EdgeR for proteomics

0

Entering edit mode

Fabricio Marchini ▴ 20

@fabricio-marchini-5543

Last seen 10.6 years ago

Hi, I'm using EdgeR to analyse a proteomic data with peptide counting. I have limited experience on R/EdgeR/Statistics so I appreciate some help. Using the follow code: a=file[,2:64] b=DGEList(counts=a,group=rep(c("La6h","La24h","Lm6h","Lm24h","MO6h","M O24h" ),c(10,11,10,11,10,11)), lib.size=colSums(a)) b <- calcNormFactors(b) times <- rep(c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h"),c(10,11,10,11, 10,11)) times <- factor(times,levels=c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h" )) design <- model.matrix(~factor(times)) disp <- estimateCommonDisp(b) fit <- glmFit(b,design,dispersion=disp$common.dispersion) lrt <- glmLRT(b,fit,coef=fit$design) disp$common.dispersion = 0.0001004979 All proteins (3430) had a p.value of 0. I tried also with fit <- glmFit(b,design,dispersion=disp$common.dispersion) lrt <- glmLRT(b,fit,coef=fit$design) disp$common.dispersion = 3.999943 and that gave me all the proteins with p.value lower than 6.29E-05. That gave a signal that I'm doing something wrong or because of both common dispersions my data is not a appropriate for the analysis. Any suggestions or corrections? -- Fabricio K. Marchini [[alternative HTML version deleted]]

edgeR edgeR • 2.2k views

ADD COMMENT • link updated 12.5 years ago by Mark Robinson ▴ 880 • written 12.5 years ago by Fabricio Marchini ▴ 20

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.4 years ago

Hi Fabricio, I suggest you check (at least) 2 things: 1. > disp <- estimateCommonDisp(b) > disp$common.dispersion = 0.0001004979 > disp$common.dispersion = 3.999943 Your example only makes 1 call to estimateCommonDisp(), but you have 2 drastically different values. Are you reporting these as the estimated values, or are you actually running this command and *setting* the common dispersion? It's not clear from your message. You may also want to study some of the GLM-based case studies in: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/do c/edgeRUsersGuide.pdf For example, the standard GLM work flow would be similar to that on Page 7. 2. > lrt <- glmLRT(b,fit,coef=fit$design) The docs for the 'coef' argument (?glmLRT) say: ---- coef: integer or character vector indicating which coefficients of the linear model are to be tested equal to zero. Values must be columns or column names of ?design?. Defaults to the last coefficient. Ignored if ?contrast? is specified. ---- As you can see, the function is expecting something very different to what give as your 'coef' argument. Maybe you want 'coef=2:6', if you are looking for any difference between your 6 groups. Of course, maybe you actually want to split your factors into 2 ? one of ("La","Lm","MO") and one of ("6h","24h") and construct a design matrix accordingly. But, this is also not clear from your message. Hope that helps, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012 On 10.10.2012, at 06:41, Fabricio Marchini wrote: > Hi, > > I'm using EdgeR to analyse a proteomic data with peptide counting. I have > limited experience on R/EdgeR/Statistics so I appreciate some help. > Using the follow code: > > a=file[,2:64] > > b=DGEList(counts=a,group=rep(c("La6h","La24h","Lm6h","Lm24h","MO6h", "MO24h" > ),c(10,11,10,11,10,11)), lib.size=colSums(a)) > > b <- calcNormFactors(b) > > times <- rep(c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h"),c(10,11,10,11, > 10,11)) > > times <- factor(times,levels=c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h" > )) > > design <- model.matrix(~factor(times)) > > disp <- estimateCommonDisp(b) > > fit <- glmFit(b,design,dispersion=disp$common.dispersion) > > lrt <- glmLRT(b,fit,coef=fit$design) > disp$common.dispersion = 0.0001004979 > > All proteins (3430) had a p.value of 0. > > I tried also with > > fit <- glmFit(b,design,dispersion=disp$common.dispersion) > > lrt <- glmLRT(b,fit,coef=fit$design) > disp$common.dispersion = 3.999943 > > and that gave me all the proteins with p.value lower than 6.29E-05. > > That gave a signal that I'm doing something wrong or because of both common > dispersions my data is not a appropriate for the analysis. > > Any suggestions or corrections? > > -- > Fabricio K. Marchini > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.5 years ago Mark Robinson ▴ 880

0

Entering edit mode

Thank you Mark, My absence of clearness is because I'm having trouble to figure out how to deal with the data. at your first question, actually there is a missing command line: > disp <- estimateCommonDisp(b) > disp$common.dispersion = 0.0001004979 *> disp <- estimateGLMCommonDisp(b)* > disp$common.dispersion = 3.999943 I was doing some tests about dispersions, I'm still not certain wich one should I use. The second question: I have 2 comtrols: MO6h MO24h (these are my controls through time) and 2 treatments for both times: La6h La24h (La treatment) Lm6h Lm24h (Lm treatment) My biological questions are, 1) do I have modulation for each treatment through time and 2) there are differences for the same time for both treatments? I do not know how should I use my controls for both situations. For the situation that I'm looking for each time point I could use a intercept design for each time point, like: design <- model.matrix(~group, data=c$samples) But I do not know if that is correct and how to analyse each treatment dealing with the controls. The best situation would be to deal first with the controls and them the treated samples through time, any ideia? -- Fabricio K. Marchini Adjunct Researcher Bioinformatics and Computational Biology Laboratory Functional Genomics Laboratory CV - CNPq <http: lattes.cnpq.br="" 4462407232591810=""> / Web site<http: marchinifk.me=""> Instituto Carlos Chagas / FIOCRUZ-PR Phone +55 41 3316 3236 Fax +55 41 3316 3267 R. Prof. Algacyr Munhoz Mader, 3775 -CIC Curitiba - PR - Brasil, 81350-010 On Wed, Oct 10, 2012 at 5:40 PM, Mark Robinson <mark.robinson@imls.uzh.ch>wrote: > Hi Fabricio, > > I suggest you check (at least) 2 things: > > 1. > > disp <- estimateCommonDisp(b) > > disp$common.dispersion = 0.0001004979 > > > disp$common.dispersion = 3.999943 > > Your example only makes 1 call to estimateCommonDisp(), but you have 2 > drastically different values. Are you reporting these as the estimated > values, or are you actually running this command and *setting* the common > dispersion? It's not clear from your message. > > You may also want to study some of the GLM-based case studies in: > > http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/ doc/edgeRUsersGuide.pdf > > For example, the standard GLM work flow would be similar to that on Page 7. > > > 2. > > lrt <- glmLRT(b,fit,coef=fit$design) > > The docs for the 'coef' argument (?glmLRT) say: > ---- > coef: integer or character vector indicating which coefficients of > the linear model are to be tested equal to zero. Values must > be columns or column names of design. Defaults to the last > coefficient. Ignored if contrast is specified. > ---- > As you can see, the function is expecting something very different to what > give as your 'coef' argument. Maybe you want 'coef=2:6', if you are looking > for any difference between your 6 groups. Of course, maybe you actually > want to split your factors into 2 one of ("La","Lm","MO") and one of > ("6h","24h") and construct a design matrix accordingly. But, this is also > not clear from your message. > > Hope that helps, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson@imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > > > On 10.10.2012, at 06:41, Fabricio Marchini wrote: > > > Hi, > > > > I'm using EdgeR to analyse a proteomic data with peptide counting. I have > > limited experience on R/EdgeR/Statistics so I appreciate some help. > > Using the follow code: > > > > a=file[,2:64] > > > > > b=DGEList(counts=a,group=rep(c("La6h","La24h","Lm6h","Lm24h","MO6h", "MO24h" > > ),c(10,11,10,11,10,11)), lib.size=colSums(a)) > > > > b <- calcNormFactors(b) > > > > times <- > rep(c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h"),c(10,11,10,11, > > 10,11)) > > > > times <- > factor(times,levels=c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h" > > )) > > > > design <- model.matrix(~factor(times)) > > > > disp <- estimateCommonDisp(b) > > > > fit <- glmFit(b,design,dispersion=disp$common.dispersion) > > > > lrt <- glmLRT(b,fit,coef=fit$design) > > disp$common.dispersion = 0.0001004979 > > > > All proteins (3430) had a p.value of 0. > > > > I tried also with > > > > fit <- glmFit(b,design,dispersion=disp$common.dispersion) > > > > lrt <- glmLRT(b,fit,coef=fit$design) > > disp$common.dispersion = 3.999943 > > > > and that gave me all the proteins with p.value lower than 6.29E-05. > > > > That gave a signal that I'm doing something wrong or because of both > common > > dispersions my data is not a appropriate for the analysis. > > > > Any suggestions or corrections? > > > > -- > > Fabricio K. Marchini > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago Fabricio Marchini ▴ 20

0

Entering edit mode

Dear Fabricio, A couple quick comments below ? On 15.10.2012, at 17:41, Fabricio Marchini wrote: > Thank you Mark, > > My absence of clearness is because I'm having trouble to figure out how to deal with the data. > > at your first question, actually there is a missing command line: > > > disp <- estimateCommonDisp(b) > > disp$common.dispersion = 0.0001004979 > > > disp <- estimateGLMCommonDisp(b) > > disp$common.dispersion = 3.999943 Read the docs for estimateGLMCommonDisp -- ?estimateGLMCommonDisp You'll want to specify your design matrix here in the call the estimateGLMCommonDisp(), otherwise it calculates dispersion relative to just an intercept model (i.e. all groups have same mean). That is probably why your estimate is so high. Do you have real biological replicates here? Given the very low dispersion with estimateCommonDisp(), I would guess no. In that case, read the edgeR users guide -- 2.9 What to do if you have no replicates. > I was doing some tests about dispersions, I'm still not certain wich one should I use. > > The second question: > > I have 2 comtrols: > MO6h MO24h (these are my controls through time) > and 2 treatments for both times: > La6h La24h (La treatment) > Lm6h Lm24h (Lm treatment) > > My biological questions are, 1) do I have modulation for each treatment through time and 2) there are differences for the same time for both treatments? > I do not know how should I use my controls for both situations. For the situation that I'm looking for each time point I could use a intercept design for each time point, like: > > design <- model.matrix(~group, data=c$samples) The comment from my previous message still applies -- split your variables into treatment and time and create a design matrix (and maybe contrast matrix) according to the questions you want to answer. Have a look at the limma/edgeR user's guide for some examples, or ask a local statistician. Best regards, Mark > > But I do not know if that is correct and how to analyse each treatment dealing with the controls. The best situation would be to deal first with the controls and them the treated samples through time, any ideia? > > -- > Fabricio K. Marchini > Adjunct Researcher > > Bioinformatics and Computational Biology Laboratory > Functional Genomics Laboratory > CV - CNPq / Web site > > Instituto Carlos Chagas / FIOCRUZ-PR > Phone +55 41 3316 3236 > Fax +55 41 3316 3267 > R. Prof. Algacyr Munhoz Mader, 3775 -CIC > Curitiba - PR - Brasil, 81350-010 > > > On Wed, Oct 10, 2012 at 5:40 PM, Mark Robinson <mark.robinson at="" imls.uzh.ch=""> wrote: > Hi Fabricio, > > I suggest you check (at least) 2 things: > > 1. > > disp <- estimateCommonDisp(b) > > disp$common.dispersion = 0.0001004979 > > > disp$common.dispersion = 3.999943 > > Your example only makes 1 call to estimateCommonDisp(), but you have 2 drastically different values. Are you reporting these as the estimated values, or are you actually running this command and *setting* the common dispersion? It's not clear from your message. > > You may also want to study some of the GLM-based case studies in: > http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/ doc/edgeRUsersGuide.pdf > > For example, the standard GLM work flow would be similar to that on Page 7. > > > 2. > > lrt <- glmLRT(b,fit,coef=fit$design) > > The docs for the 'coef' argument (?glmLRT) say: > ---- > coef: integer or character vector indicating which coefficients of > the linear model are to be tested equal to zero. Values must > be columns or column names of ?design?. Defaults to the last > coefficient. Ignored if ?contrast? is specified. > ---- > As you can see, the function is expecting something very different to what give as your 'coef' argument. Maybe you want 'coef=2:6', if you are looking for any difference between your 6 groups. Of course, maybe you actually want to split your factors into 2 ? one of ("La","Lm","MO") and one of ("6h","24h") and construct a design matrix accordingly. But, this is also not clear from your message. > > Hope that helps, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > > > On 10.10.2012, at 06:41, Fabricio Marchini wrote: > > > Hi, > > > > I'm using EdgeR to analyse a proteomic data with peptide counting. I have > > limited experience on R/EdgeR/Statistics so I appreciate some help. > > Using the follow code: > > > > a=file[,2:64] > > > > b=DGEList(counts=a,group=rep(c("La6h","La24h","Lm6h","Lm24h","MO6h ","MO24h" > > ),c(10,11,10,11,10,11)), lib.size=colSums(a)) > > > > b <- calcNormFactors(b) > > > > times <- rep(c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h"),c(10,11,10,11, > > 10,11)) > > > > times <- factor(times,levels=c("La6h","La24h","Lm6h","Lm24h","MO6h","MO24h" > > )) > > > > design <- model.matrix(~factor(times)) > > > > disp <- estimateCommonDisp(b) > > > > fit <- glmFit(b,design,dispersion=disp$common.dispersion) > > > > lrt <- glmLRT(b,fit,coef=fit$design) > > disp$common.dispersion = 0.0001004979 > > > > All proteins (3430) had a p.value of 0. > > > > I tried also with > > > > fit <- glmFit(b,design,dispersion=disp$common.dispersion) > > > > lrt <- glmLRT(b,fit,coef=fit$design) > > disp$common.dispersion = 3.999943 > > > > and that gave me all the proteins with p.value lower than 6.29E-05. > > > > That gave a signal that I'm doing something wrong or because of both common > > dispersions my data is not a appropriate for the analysis. > > > > Any suggestions or corrections? > > > > -- > > Fabricio K. Marchini > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 12.5 years ago Mark Robinson ▴ 880

Login before adding your answer.