edgeR design matrix, one group vs average of other groups

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 6.0 years ago

United Kingdom

Dear Bioconductors, I am working on RNA-seq data with multiple experimental factors and I am trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. > design <- model.matrix(~0+group, data=y$samples) > colnames(design) <- levels(y$samples$group) > design A B C sample.1 1 0 0 sample.2 1 0 0 sample.3 0 1 0 sample.4 0 1 0 sample.5 0 0 1 > fit <- glmFit(y, design) I want to know which genes are differentially expressed in C compared to the other groups, so I chose to compare C to the average of A and B > lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) Alternatively I could put A and B in a single group > design A.B C sample.1 1 0 sample.2 1 0 sample.3 1 0 sample.4 1 0 sample.5 0 1 > fit <- glmFit(y, design) an compare C to A.B > lrt <- glmLRT(fit, contrast=c(-1,1)) When I try this with my own data, the first approach gives me many more differentially expressed genes than the second one, but the second gene set is a subset of the first one. I would be very grateful if somebody could explain to me what is the difference between the approaches, and which one is the more appropriate for my purpose (find genes specific for condition C) Best wishes, Georg > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.18.13 loaded via a namespace (and not attached): [1] compiler_3.0.1 tools_3.0.1

edgeR edgeR • 4.1k views

ADD COMMENT • link updated 10.7 years ago by Gordon Smyth 52k • written 10.7 years ago by Georg Otto ▴ 120

1

Entering edit mode

Yunshun Chen ▴ 900

@yunshun-chen-5451

Last seen 5 days ago

Australia

Dear Georg, I think the first model is more appropriate. In your second model, the deviance under the alternative is larger than it should (since the difference between A and B is not accounted for). That makes the LR statistics smaller than they should, which results in fewer DE genes. By the way, the dispersions have to be re-estimated if a different design matrix is used. I'm not sure whether you've done it or not as I cannot tell from the given code. Regards, Yunshun Chen _____ Dear Bioconductors, I am working on RNA-seq data with multiple experimental factors and I am trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. > design <- model.matrix(~0+group, data=y$samples) > colnames(design) <- levels(y$samples$group) > design A B C sample.1 1 0 0 sample.2 1 0 0 sample.3 0 1 0 sample.4 0 1 0 sample.5 0 0 1 > fit <- glmFit(y, design) I want to know which genes are differentially expressed in C compared to the other groups, so I chose to compare C to the average of A and B > lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) Alternatively I could put A and B in a single group > design A.B C sample.1 1 0 sample.2 1 0 sample.3 1 0 sample.4 1 0 sample.5 0 1 > fit <- glmFit(y, design) an compare C to A.B > lrt <- glmLRT(fit, contrast=c(-1,1)) When I try this with my own data, the first approach gives me many more differentially expressed genes than the second one, but the second gene set is a subset of the first one. I would be very grateful if somebody could explain to me what is the difference between the approaches, and which one is the more appropriate for my purpose (find genes specific for condition C) Best wishes, Georg > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.18.13 loaded via a namespace (and not attached): [1] compiler_3.0.1 tools_3.0.1 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 10.7 years ago Yunshun Chen ▴ 900

0

Entering edit mode

Hi, Yunshun and Georg: I have dataset with a very similar situation. I have a lung cancer dataset with tumors vs normals and amongst them, these tumors have a few subtypes as we believe and hypothesized, I am looking for the difference between the subtypes of tumors, e.g., tumor type 1 vs tumor type 2, as well as tumor type 1 vs normal type 1, or tumor type 2 vs normal type 2, etc, but also at the same time also interested in overall tumors vs normals contrasts. since I want to assess the whole thing in the same roof. in the makeContrasts function, I did set up tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) like Georg in the first model. whileas I can do similar setting as Georg did in second model, ignore the subtypes of tumors, just simply do tumor-normal in setting, which is generally done in the field since many generally not know much about the subtypes or they want to study the tumor as a whole. As you pointed out, the second model ignored the subtypes and did not account for the difference between A and B (in my case, difference between type 1 and 2). In other words, the variance between the two subtypes of tumors would be considered as "common" variance in general amongst the overall tumors. And you also mentioned: That makes the LR statistics smaller than they should, which results in fewer DE genes in the second case. And so in the first model, the DEGs derived from the tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) would have more DEGs than that derived from simple overall tumors vs normals=tumor-normal contrast, simply because the first model would have added genes that was given up by the 2nd model but picked up by the first one because now some of the internal variance of tumors would be considered as difference of the two subtyes to my understanding. Now my question is: what these "extra" genes represented for? are these genes represented the DEGs between tumor vs normal overall as well as also represented the difference between the subtypes? In fact, the reason I am asking about and discussed here is: I am looking for genes that separating well the subtypes of tumors but also separating the tumor and normal overall in general. My appraoches is finding the DEGs from the subtypes of tumors e.g., tumor type 1 vs tumor type 2, and the general DEGs from tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) and combing the common genes from these two sets of DEGs. Turns out the common genes seem not able to separate well the two tumor subtypes as well as overall tumor vs normal at the same time. One reason is the difference between the overall tumor vs normal is so strong that weaken the difference between the two subtypes. Also the DEGs from tumor_vs_normal I used are derived from tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) setting rather than the tumor_vs_normal contrast from simply tumor-normal in GLM model. so for purpose of mine: looking for genes that separating well the subtypes of tumors but also separating the tumor and normal overall in general, what would be the best way I can do in using these DEGs? Sorry, it is a bit off the original topic, but I thought because it seems so relevant to how to set up the model for deriving DEGs for a variary of purposes, probably worthy to be posted here for a discussion. Thanks for your insightful opinion in advance! best Ming > From: yuchen@wehi.EDU.AU > To: georg.otto@imm.ox.ac.uk > Date: Wed, 12 Mar 2014 15:37:36 +1100 > CC: bioconductor@r-project.org > Subject: Re: [BioC] edgeR design matrix, one group vs average of other groups > > Dear Georg, > > > > I think the first model is more appropriate. > > > > In your second model, the deviance under the alternative is larger than it > should (since the difference between A and B is not accounted for). > > That makes the LR statistics smaller than they should, which results in > fewer DE genes. > > > > By the way, the dispersions have to be re-estimated if a different design > matrix is used. > > I'm not sure whether you've done it or not as I cannot tell from the given > code. > > > > Regards, > > Yunshun Chen > > > > > > _____ > > Dear Bioconductors, > > I am working on RNA-seq data with multiple experimental factors and I am > trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. > > > > design <- model.matrix(~0+group, data=y$samples) > > colnames(design) <- levels(y$samples$group) > > design > A B C > sample.1 1 0 0 > sample.2 1 0 0 > sample.3 0 1 0 > sample.4 0 1 0 > sample.5 0 0 1 > > > fit <- glmFit(y, design) > > > I want to know which genes are differentially expressed in C compared to > the other groups, so I chose to compare C to the average of A and B > > > lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) > > > Alternatively I could put A and B in a single group > > > design > A.B C > sample.1 1 0 > sample.2 1 0 > sample.3 1 0 > sample.4 1 0 > sample.5 0 1 > > > fit <- glmFit(y, design) > > an compare C to A.B > > > lrt <- glmLRT(fit, contrast=c(-1,1)) > > > When I try this with my own data, the first approach gives me many more > differentially expressed genes than the second one, but the second gene > set is a subset of the first one. I would be very grateful if somebody > could explain to me what is the difference between the approaches, and > which one is the more appropriate for my purpose (find genes specific > for condition C) > > Best wishes, > > Georg > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.18.13 > > loaded via a namespace (and not attached): > [1] compiler_3.0.1 tools_3.0.1 > > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:13}}

ADD REPLY • link 10.7 years ago Ming ▴ 380

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 13 hours ago

WEHI, Melbourne, Australia

Dear Georg, It makes no difference how many samples you have in each group (and why should it?). The choice of test depends on the scientific questions you wish to answer, not on technical aspects of your dataset. The only reason that you might combine A and B would be if you specifically wanted to find genes that are *same* in A and B but different in C. From what you have said, that is not want you want. Best wishes Gordon > Date: Fri, 14 Mar 2014 19:26:08 +0000 > From: Georg Otto <georg.otto at="" imm.ox.ac.uk=""> > To: <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] edgeR design matrix, one group vs average of other > groups > > > Thanks a lot, Yunshun and Ryan for your informative answers. I > understand that for my purposes it is preferable to use a design matrix > like that > > >> design > A B C > sample.1 1 0 0 > sample.2 1 0 0 > sample.3 0 1 0 > sample.4 0 1 0 > sample.5 0 0 1 > > and average for the contrast like this > >> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) > > > But what would happen if there is a strong imbalance between samples A > and B, eg: > >> design > A B C > sample.1 1 0 0 > sample.2 1 0 0 > sample.3 1 0 0 > sample.4 1 0 0 > sample.5 1 0 0 > sample.6 1 0 0 > sample.7 0 1 0 > sample.8 0 1 0 > sample.9 0 0 1 > > > Should I still use the above approach or is it more advisable to put A > and B in one group and test AB vs C? > >> design > A.B C > sample.1 1 0 > sample.2 1 0 > sample.3 1 0 > sample.4 1 0 > sample.5 1 0 > sample.6 1 0 > sample.7 1 0 > sample.8 1 0 > sample.9 0 1 > >> lrt <- glmLRT(fit, contrast=c(-1,1)) > > > Thanks a lot and best wishes, > > Georg > > > Georg Otto <georg.otto at="" imm.ox.ac.uk=""> writes: > > > >> Dear Bioconductors, >> >> I am working on RNA-seq data with multiple experimental factors and I am >> trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. >> >> >>> design <- model.matrix(~0+group, data=y$samples) >>> colnames(design) <- levels(y$samples$group) >>> design >> A B C >> sample.1 1 0 0 >> sample.2 1 0 0 >> sample.3 0 1 0 >> sample.4 0 1 0 >> sample.5 0 0 1 >> >>> fit <- glmFit(y, design) >> >> >> I want to know which genes are differentially expressed in C compared to >> the other groups, so I chose to compare C to the average of A and B >> >>> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) >> >> >> Alternatively I could put A and B in a single group >> >>> design >> A.B C >> sample.1 1 0 >> sample.2 1 0 >> sample.3 1 0 >> sample.4 1 0 >> sample.5 0 1 >> >>> fit <- glmFit(y, design) >> >> an compare C to A.B >> >>> lrt <- glmLRT(fit, contrast=c(-1,1)) >> >> >> When I try this with my own data, the first approach gives me many more >> differentially expressed genes than the second one, but the second gene >> set is a subset of the first one. I would be very grateful if somebody >> could explain to me what is the difference between the approaches, and >> which one is the more appropriate for my purpose (find genes specific >> for condition C) >> >> Best wishes, >> >> Georg >> >>> sessionInfo() >> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] limma_3.18.13 >> >> loaded via a namespace (and not attached): >> [1] compiler_3.0.1 tools_3.0.1 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.7 years ago Gordon Smyth 52k

0

Entering edit mode

ying chen ▴ 340

@ying-chen-5085

Last seen 10.2 years ago

Hi guys, I tried to draw plots similar to cbioportal's oncoprint plot, but have no success yet. Here is an example of the oncoprint plot: http://www.cbioportal.org/public-portal/index.do?cancer_study_id=luad_ tcga&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=luad_tcga_mutations &genetic_profile_ids_PROFILE_COPY_NUMBER_ALTERATION=luad_tcga_gistic&g enetic_profile_ids_PROFILE_MRNA_EXPRESSION=luad_tcga_rna_seq_v2_mrna_m edian_Zscores&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_prio rity=0&case_set_id=luad_tcga_all&case_ids=&gene_set_choice=user- defined-list&gene_list=FGFR3&clinical_param_selection=null&tab_index=t ab_visualize&Action=Submit Any suggestion, Thanks, Ying [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago ying chen ▴ 340

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 7 weeks ago

Icahn School of Medicine at Mount Sinai…

Hi Georg, Gordon Smyth gave a quite comprehensive answer to this and similar issues a little while ago in answer to one of my questions. Here are the links to the relevant posts: http://permalink.gmane.org/gmane.science.biology.informatics.conductor /52714 http://permalink.gmane.org/gmane.science.biology.informatics.conductor /52752 -Ryan On 3/11/14, 10:54 AM, Georg Otto wrote: > Dear Bioconductors, > > I am working on RNA-seq data with multiple experimental factors and I am > trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. > > >> design <- model.matrix(~0+group, data=y$samples) >> colnames(design) <- levels(y$samples$group) >> design > A B C > sample.1 1 0 0 > sample.2 1 0 0 > sample.3 0 1 0 > sample.4 0 1 0 > sample.5 0 0 1 > >> fit <- glmFit(y, design) > > I want to know which genes are differentially expressed in C compared to > the other groups, so I chose to compare C to the average of A and B > >> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) > > Alternatively I could put A and B in a single group > >> design > A.B C > sample.1 1 0 > sample.2 1 0 > sample.3 1 0 > sample.4 1 0 > sample.5 0 1 > >> fit <- glmFit(y, design) > an compare C to A.B > >> lrt <- glmLRT(fit, contrast=c(-1,1)) > > When I try this with my own data, the first approach gives me many more > differentially expressed genes than the second one, but the second gene > set is a subset of the first one. I would be very grateful if somebody > could explain to me what is the difference between the approaches, and > which one is the more appropriate for my purpose (find genes specific > for condition C) > > Best wishes, > > Georg > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.18.13 > > loaded via a namespace (and not attached): > [1] compiler_3.0.1 tools_3.0.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Yunshun Chen ▴ 900

@yunshun-chen-5451

Last seen 5 days ago

Australia

Hi Ming, These extra genes are the genes that you missed if you treat A and B the same. If you want to see whether they are also different between A and B, you should test them formally using the contrast A-B. For your purpose, you can do the two testings separately and then look for common genes. Regards, Yunshun ------------------------------ Message: 4 Date: Wed, 12 Mar 2014 13:56:04 +0000 From: Ming Yi <yi02@hotmail.com> To: "yuchen@wehi.EDU.AU" <yuchen@wehi.edu.au>, "georg.otto@imm.ox.ac.uk" <georg.otto@imm.ox.ac.uk> Cc: Bioconductor mailing list <bioconductor@r-project.org> Subject: Re: [BioC] edgeR design matrix, one group vs average of other groups Message-ID: <blu177-w35d8e2e73bcca1ff76b9f7dd760@phx.gbl> Content-Type: text/plain Hi, Yunshun and Georg: I have dataset with a very similar situation. I have a lung cancer dataset with tumors vs normals and amongst them, these tumors have a few subtypes as we believe and hypothesized, I am looking for the difference between the subtypes of tumors, e.g., tumor type 1 vs tumor type 2, as well as tumor type 1 vs normal type 1, or tumor type 2 vs normal type 2, etc, but also at the same time also interested in overall tumors vs normals contrasts. since I want to assess the whole thing in the same roof. in the makeContrasts function, I did set up tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) like Georg in the first model. whileas I can do similar setting as Georg did in second model, ignore the subtypes of tumors, just simply do tumor-normal in setting, which is generally done in the field since many generally not know much about the subtypes or they want to study the tumor as a whole. As you pointed out, the second model ignored the subtypes and! did not account for the difference between A and B (in my case, difference between type 1 and 2). In other words, the variance between the two subtypes of tumors would be considered as "common" variance in general amongst the overall tumors. And you also mentioned: That makes the LR statistics smaller than they should, which results in fewer DE genes in the second case. And so in the first model, the DEGs derived from the tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) would have more DEGs than that derived from simple overall tumors vs normals=tumor-normal contrast, simply because the first model would have added genes that was given up by the 2nd model but picked up by the first one because now some of the internal variance of tumors would be considered as difference of the two subtyes to my understanding. Now my question is: what these "extra" genes represented for? are these genes represented the DEGs between tumor vs normal overall ! as well as also represented the difference between the subtypes? In fact, the reason I am asking about and discussed here is: I am looking for genes that separating well the subtypes of tumors but also separating the tumor and normal overall in general. My appraoches is finding the DEGs from the subtypes of tumors e.g., tumor type 1 vs tumor type 2, and the general DEGs from tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) and combing the common genes from these two sets of DEGs. Turns out the common genes seem not able to separate well the two tumor subtypes as well as overall tumor vs normal at the same time. One reason is the difference between the overall tumor vs normal is so strong that weaken the difference between the two subtypes. Also the DEGs from tumor_vs_normal I used are derived from tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type 1+normal type 2) setting rather than the tumor_vs_normal contrast from simply tumor-normal in GLM model. so for purpose of mine: looking for gene! s that separating well the subtypes of tumors but also separating the tumor and normal overall in general, what would be the best way I can do in using these DEGs? Sorry, it is a bit off the original topic, but I thought because it seems so relevant to how to set up the model for deriving DEGs for a variary of purposes, probably worthy to be posted here for a discussion. Thanks for your insightful opinion in advance! best Ming > From: yuchen@wehi.EDU.AU > To: georg.otto@imm.ox.ac.uk > Date: Wed, 12 Mar 2014 15:37:36 +1100 > CC: bioconductor@r-project.org > Subject: Re: [BioC] edgeR design matrix, one group vs average of other groups > > Dear Georg, > > > > I think the first model is more appropriate. > > > > In your second model, the deviance under the alternative is larger than it > should (since the difference between A and B is not accounted for). > > That makes the LR statistics smaller than they should, which results in > fewer DE genes. > > > > By the way, the dispersions have to be re-estimated if a different design > matrix is used. > > I'm not sure whether you've done it or not as I cannot tell from the given > code. > > > > Regards, > > Yunshun Chen > > > > > > _____ > > Dear Bioconductors, > > I am working on RNA-seq data with multiple experimental factors and I am > trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. > > > > design <- model.matrix(~0+group, data=y$samples) > > colnames(design) <- levels(y$samples$group) > > design > A B C > sample.1 1 0 0 > sample.2 1 0 0 > sample.3 0 1 0 > sample.4 0 1 0 > sample.5 0 0 1 > > > fit <- glmFit(y, design) > > > I want to know which genes are differentially expressed in C compared to > the other groups, so I chose to compare C to the average of A and B > > > lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) > > > Alternatively I could put A and B in a single group > > > design > A.B C > sample.1 1 0 > sample.2 1 0 > sample.3 1 0 > sample.4 1 0 > sample.5 0 1 > > > fit <- glmFit(y, design) > > an compare C to A.B > > > lrt <- glmLRT(fit, contrast=c(-1,1)) > > > When I try this with my own data, the first approach gives me many more > differentially expressed genes than the second one, but the second gene > set is a subset of the first one. I would be very grateful if somebody > could explain to me what is the difference between the approaches, and > which one is the more appropriate for my purpose (find genes specific > for condition C) > > Best wishes, > > Georg > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.18.13 > > loaded via a namespace (and not attached): > [1] compiler_3.0.1 tools_3.0.1 > > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:12}}

ADD COMMENT • link 10.7 years ago Yunshun Chen ▴ 900

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 6.0 years ago

United Kingdom

Thanks a lot, Yunshun and Ryan for your informative answers. I understand that for my purposes it is preferable to use a design matrix like that > design A B C sample.1 1 0 0 sample.2 1 0 0 sample.3 0 1 0 sample.4 0 1 0 sample.5 0 0 1 and average for the contrast like this > lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) But what would happen if there is a strong imbalance between samples A and B, eg: > design A B C sample.1 1 0 0 sample.2 1 0 0 sample.3 1 0 0 sample.4 1 0 0 sample.5 1 0 0 sample.6 1 0 0 sample.7 0 1 0 sample.8 0 1 0 sample.9 0 0 1 Should I still use the above approach or is it more advisable to put A and B in one group and test AB vs C? > design A.B C sample.1 1 0 sample.2 1 0 sample.3 1 0 sample.4 1 0 sample.5 1 0 sample.6 1 0 sample.7 1 0 sample.8 1 0 sample.9 0 1 > lrt <- glmLRT(fit, contrast=c(-1,1)) Thanks a lot and best wishes, Georg Georg Otto <georg.otto at="" imm.ox.ac.uk=""> writes: > Dear Bioconductors, > > I am working on RNA-seq data with multiple experimental factors and I am > trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach. > > >> design <- model.matrix(~0+group, data=y$samples) >> colnames(design) <- levels(y$samples$group) >> design > A B C > sample.1 1 0 0 > sample.2 1 0 0 > sample.3 0 1 0 > sample.4 0 1 0 > sample.5 0 0 1 > >> fit <- glmFit(y, design) > > > I want to know which genes are differentially expressed in C compared to > the other groups, so I chose to compare C to the average of A and B > >> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1)) > > > Alternatively I could put A and B in a single group > >> design > A.B C > sample.1 1 0 > sample.2 1 0 > sample.3 1 0 > sample.4 1 0 > sample.5 0 1 > >> fit <- glmFit(y, design) > > an compare C to A.B > >> lrt <- glmLRT(fit, contrast=c(-1,1)) > > > When I try this with my own data, the first approach gives me many more > differentially expressed genes than the second one, but the second gene > set is a subset of the first one. I would be very grateful if somebody > could explain to me what is the difference between the approaches, and > which one is the more appropriate for my purpose (find genes specific > for condition C) > > Best wishes, > > Georg > >> sessionInfo() > > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.18.13 > > loaded via a namespace (and not attached): > [1] compiler_3.0.1 tools_3.0.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.7 years ago Georg Otto ▴ 120

Login before adding your answer.