GSEA and edgeR
3
0
Entering edit mode
@iain-gallagher-2532
Last seen 9.4 years ago
United Kingdom
Dear List / Gordon I would like to carry out a GSEA analysis on some RNA-Seq data. I have analysed the data using edgeR and have a list of regulated genes. The data is paired - 6 biological samples, cells from each are infected or uninfected. Looking around there is little advice on the use of GSEA in RNA-Seq data. Using edgeR (and having a relatively small sample size) I was hoping to make use of the romer algorithm which is implemented in limma. However I am having a conceptual problem setting up my data for this. As the study uses paired sample data I have followed the guidance for this type of analysis in the marvellous edgeR manual. Searching for advice on how to conduct GSEA I came across this post (http://www .mail-archive.com/bioc-sig-sequencing at r-project.org/msg02020.html) by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable strategy for edgeR and the rotational GSEA algorithms. These algorithms require the specification of a contrast (in my case between infected and uninfected animals). e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999) My design matrix looks like this: library(limma) samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2) infection <- c(rep(1,6), rep(2,6)) s <- as.factor(samples) i <- as.factor(infection) design <- model.matrix(~s+i) colnames(design)[7] <- 'infection' > design (Intercept) ss2 ss3 ss4 ss5 ss6 infection 1 1 0 0 0 0 0 0 2 1 1 0 0 0 0 0 3 1 0 1 0 0 0 0 4 1 0 0 1 0 0 0 5 1 0 0 0 1 0 0 6 1 0 0 0 0 1 0 7 1 0 0 0 0 0 1 8 1 1 0 0 0 0 1 9 1 0 1 0 0 0 1 10 1 0 0 1 0 0 1 11 1 0 0 0 1 0 1 12 1 0 0 0 0 1 1 attr(,"assign") [1] 0 1 1 1 1 1 2 attr(,"contrasts") attr(,"contrasts")$s [1] "contr.treatment" attr(,"contrasts")$i [1] "contr.treatment" and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!). #dispersion estimate d <- estimateGLMCommonDisp(d, design) #fit the NB GLM for each gene fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion) #carry out the likliehood ratio test lrtFiltered <- glmLRT(d, fitFiltered, coef = 7) #how many genes are DE? summary(decideTestsDGE(lrtFiltered))#DE genes So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer? Thanks Best Iain > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8 [7] LC_PAPER=en_GB.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0 [4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0 [7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1 [10] Biobase_2.12.2 edgeR_2.2.5 limma_3.8.3 loaded via a namespace (and not attached): [1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0 [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1 [7] IRanges_1.10.6 lattice_0.19-33 Matrix_0.9996875-3 [10] mgcv_1.7-6 nlme_3.1-102 RCurl_1.6-9 [13] rtracklayer_1.12.4 tools_2.13.1 XML_3.4-2 [16] xtable_1.5-6 >
GO edgeR GO edgeR • 4.6k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
Hi, Iain. Before proceeding, you might want to take a look at: http://bioconductor.org/packages/release/bioc/html/goseq.html This paper and others like it describe the problem and another solution: http://www.ncbi.nlm.nih.gov/pubmed/21252076 Sean On Thu, Sep 1, 2011 at 9:13 AM, Iain Gallagher <iaingallagher at="" btopenworld.com=""> wrote: > Dear List / Gordon > > I would like to carry out a GSEA analysis on some RNA-Seq data. I have analysed the data using edgeR and have a list of regulated genes. The data is paired - 6 biological samples, cells from each are infected or uninfected. > > Looking around there is little advice on the use of GSEA in RNA-Seq data. Using edgeR (and having a relatively small sample size) I was hoping to make use of the romer algorithm which is implemented in limma. However I am having a conceptual problem setting up my data for this. > > As the study uses paired sample data I have followed the guidance for this type of analysis in the marvellous edgeR manual. Searching for advice on how to conduct GSEA I came across this post (http://www .mail-archive.com/bioc-sig-sequencing at r-project.org/msg02020.html) by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable strategy for edgeR and the rotational GSEA algorithms. > > These algorithms require the specification of a contrast (in my case between infected and uninfected animals). > > e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999) > > My design matrix looks like this: > > library(limma) > > samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2) > infection <- c(rep(1,6), rep(2,6)) > > s <- as.factor(samples) > i <- as.factor(infection) > > design <- model.matrix(~s+i) > colnames(design)[7] <- 'infection' > >> design > ? (Intercept) ss2 ss3 ss4 ss5 ss6 infection > 1 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? ? ? ? 0 > 2 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? ? ? ? 0 > 3 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 0 ? ? ? ? 0 > 4 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 0 ? ? ? ? 0 > 5 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 1 ? 0 ? ? ? ? 0 > 6 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 1 ? ? ? ? 0 > 7 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? ? ? ? 1 > 8 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? ? ? ? 1 > 9 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 0 ? ? ? ? 1 > 10 ? ? ? ? ? 1 ? 0 ? 0 ? 1 ? 0 ? 0 ? ? ? ? 1 > 11 ? ? ? ? ? 1 ? 0 ? 0 ? 0 ? 1 ? 0 ? ? ? ? 1 > 12 ? ? ? ? ? 1 ? 0 ? 0 ? 0 ? 0 ? 1 ? ? ? ? 1 > attr(,"assign") > [1] 0 1 1 1 1 1 2 > attr(,"contrasts") > attr(,"contrasts")$s > [1] "contr.treatment" > > attr(,"contrasts")$i > [1] "contr.treatment" > > and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!). > > #dispersion estimate > d <- estimateGLMCommonDisp(d, design) > > #fit the NB GLM for each gene > fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion) > > #carry out the likliehood ratio test > lrtFiltered <- glmLRT(d, fitFiltered, coef = 7) > > #how many genes are DE? > summary(decideTestsDGE(lrtFiltered))#DE genes > > So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer? > > Thanks > > Best > > Iain > >> sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_GB.utf8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_GB.utf8 ? ? ? ?LC_COLLATE=en_GB.utf8 > ?[5] LC_MONETARY=C ? ? ? ? ? ? LC_MESSAGES=en_GB.utf8 > ?[7] LC_PAPER=en_GB.utf8 ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ?LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > ?[1] annotate_1.30.0 ? ? ? ?GO.db_2.5.0 ? ? ? ? ? ?goseq_1.4.0 > ?[4] geneLenDataBase_0.99.7 BiasedUrn_1.04 ? ? ? ? org.Bt.eg.db_2.5.0 > ?[7] RSQLite_0.9-4 ? ? ? ? ?DBI_0.2-5 ? ? ? ? ? ? ?AnnotationDbi_1.14.1 > [10] Biobase_2.12.2 ? ? ? ? edgeR_2.2.5 ? ? ? ? ? ?limma_3.8.3 > > loaded via a namespace (and not attached): > ?[1] biomaRt_2.8.1 ? ? ? ? Biostrings_2.20.3 ? ? BSgenome_1.20.0 > ?[4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 ? grid_2.13.1 > ?[7] IRanges_1.10.6 ? ? ? ?lattice_0.19-33 ? ? ? Matrix_0.9996875-3 > [10] mgcv_1.7-6 ? ? ? ? ? ?nlme_3.1-102 ? ? ? ? ?RCurl_1.6-9 > [13] rtracklayer_1.12.4 ? ?tools_2.13.1 ? ? ? ? ?XML_3.4-2 > [16] xtable_1.5-6 >> > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Sean Thanks for your reply. I have already carried out a goseq analysis on my data. As I understand it goseq (like GOstats) looks for enrichment in a list of genes defined a priori as regulated and flags categories those genes belong to as enriched or not in that list of regulated genes. I was more interested in testing 'sets' as a whole without defining a list of regulated genes up front. I should have mentioned in my original message that I'd used goseq already since the two techniques are related. Aplologies for that. Best Iain --- On Thu, 1/9/11, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > From: Sean Davis <sdavis2 at="" mail.nih.gov=""> > Subject: Re: [BioC] GSEA and edgeR > To: "Iain Gallagher" <iaingallagher at="" btopenworld.com=""> > Cc: "bioconductor" <bioconductor at="" stat.math.ethz.ch="">, smyth at wehi.edu.au > Date: Thursday, 1 September, 2011, 16:11 > Hi, Iain. > > Before proceeding, you might want to take a look at: > > http://bioconductor.org/packages/release/bioc/html/goseq.html > > This paper and others like it describe the problem and > another solution: > > http://www.ncbi.nlm.nih.gov/pubmed/21252076 > > Sean > > > On Thu, Sep 1, 2011 at 9:13 AM, Iain Gallagher > <iaingallagher at="" btopenworld.com=""> > wrote: > > Dear List / Gordon > > > > I would like to carry out a GSEA analysis on some > RNA-Seq data. I have analysed the data using edgeR and have > a list of regulated genes. The data is paired - 6 biological > samples, cells from each are infected or uninfected. > > > > Looking around there is little advice on the use of > GSEA in RNA-Seq data. Using edgeR (and having a relatively > small sample size) I was hoping to make use of the romer > algorithm which is implemented in limma. However I am having > a conceptual problem setting up my data for this. > > > > As the study uses paired sample data I have followed > the guidance for this type of analysis in the marvellous > edgeR manual. Searching for advice on how to conduct GSEA I > came across this post (http://www.mail-archive.com/bioc-sig- sequencing at r-project.org/msg02020.html) > by Gordon Smyth (author of edgeR / limma) wherein he > describes a suitable strategy for edgeR and the rotational > GSEA algorithms. > > > > These algorithms require the specification of a > contrast (in my case between infected and uninfected > animals). > > > > e.g. test <- romer(geneIndex, countData, > design=design, contrast=contr[,1], nrot=9999) > > > > My design matrix looks like this: > > > > library(limma) > > > > samples <- rep(c('s1', 's2', 's3', 's4', 's5', > 's6'),2) > > infection <- c(rep(1,6), rep(2,6)) > > > > s <- as.factor(samples) > > i <- as.factor(infection) > > > > design <- model.matrix(~s+i) > > colnames(design)[7] <- 'infection' > > > >> design > > ? (Intercept) ss2 ss3 ss4 ss5 ss6 infection > > 1 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? ? ? > ? 0 > > 2 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? ? ? > ? 0 > > 3 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 0 ? ? ? > ? 0 > > 4 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 0 ? ? ? > ? 0 > > 5 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 1 ? 0 ? ? ? > ? 0 > > 6 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 1 ? ? ? > ? 0 > > 7 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? ? ? > ? 1 > > 8 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? ? ? > ? 1 > > 9 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 0 ? ? ? > ? 1 > > 10 ? ? ? ? ? 1 ? 0 ? 0 ? 1 ? 0 ? 0 ? ? ? > ? 1 > > 11 ? ? ? ? ? 1 ? 0 ? 0 ? 0 ? 1 ? 0 ? ? ? > ? 1 > > 12 ? ? ? ? ? 1 ? 0 ? 0 ? 0 ? 0 ? 1 ? ? ? > ? 1 > > attr(,"assign") > > [1] 0 1 1 1 1 1 2 > > attr(,"contrasts") > > attr(,"contrasts")$s > > [1] "contr.treatment" > > > > attr(,"contrasts")$i > > [1] "contr.treatment" > > > > and in edgeR I carry out the following fit of a GLM to > get the genes regulated between infected and uninfected > animals i.e. fit a GLM for infection and no infection then > edgeR carries out likeliehood tests to find the genes where > the two models differ (I think!). > > > > #dispersion estimate > > d <- estimateGLMCommonDisp(d, design) > > > > #fit the NB GLM for each gene > > fitFiltered <- glmFit(d, design, dispersion = > d$common.dispersion) > > > > #carry out the likliehood ratio test > > lrtFiltered <- glmLRT(d, fitFiltered, coef = 7) > > > > #how many genes are DE? > > summary(decideTestsDGE(lrtFiltered))#DE genes > > > > So (finally) my question is how do I set up a contrast > (whilst keeping the pairing) for romer? > > > > Thanks > > > > Best > > > > Iain > > > >> sessionInfo() > > R version 2.13.1 (2011-07-08) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > ?[1] LC_CTYPE=en_GB.utf8 ? ? ? LC_NUMERIC=C > > ?[3] LC_TIME=en_GB.utf8 ? ? ? > ?LC_COLLATE=en_GB.utf8 > > ?[5] LC_MONETARY=C ? ? ? ? ? ? > LC_MESSAGES=en_GB.utf8 > > ?[7] LC_PAPER=en_GB.utf8 ? ? ? LC_NAME=C > > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ?LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats ? ? graphics ?grDevices utils ? ? > datasets ?methods ? base > > > > other attached packages: > > ?[1] annotate_1.30.0 ? ? ? ?GO.db_2.5.0 ? ? ? > ? ? ?goseq_1.4.0 > > ?[4] geneLenDataBase_0.99.7 BiasedUrn_1.04 ? ? ? > ? org.Bt.eg.db_2.5.0 > > ?[7] RSQLite_0.9-4 ? ? ? ? ?DBI_0.2-5 ? ? ? > ? ? ? ?AnnotationDbi_1.14.1 > > [10] Biobase_2.12.2 ? ? ? ? edgeR_2.2.5 ? ? ? > ? ? ?limma_3.8.3 > > > > loaded via a namespace (and not attached): > > ?[1] biomaRt_2.8.1 ? ? ? ? Biostrings_2.20.3 ? > ? BSgenome_1.20.0 > > ?[4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 ? > grid_2.13.1 > > ?[7] IRanges_1.10.6 ? ? ? ?lattice_0.19-33 ? ? > ? Matrix_0.9996875-3 > > [10] mgcv_1.7-6 ? ? ? ? ? ?nlme_3.1-102 ? ? ? > ? ?RCurl_1.6-9 > > [13] rtracklayer_1.12.4 ? ?tools_2.13.1 ? ? ? ? > ?XML_3.4-2 > > [16] xtable_1.5-6 > >> > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > >
ADD REPLY
0
Entering edit mode
Hello List As a follow-up to my original query on GSEA with edgeR I recieved the following advice from Gordon Smyth: "Of the gene set tests in limma, the only ones that can be used as part of a negative binomial based analysis of count data are wilcoxGST and geneSetTest. The others, roast and romer, make multivariate normal assumptions that are incompatible with count distributions. To make a gene set test using geneSetTest, you would do something like index <- lrtFilter$genes$ID %in% MyGeneSetSymbols p.value <- geneSetTest(index, lrtFiltered$table$LR.statistic, type="f") The approach does treat the genes as statistically independent, so gives p-values that are bit optimistic, but otherwise it is a productive approach. To use romer, you would have to make some normal approximations: effective.lib.size <- d$samples$lib.size * d$samples$norm.factors y <- log2( t(t(d$counts+0.5) / (effective.lib.size+0.5)) ) Then you can use test <- romer(geneIndex, y, design, contrast=7) Actually, you don't even need to set the contrast argument, as the last coefficient is the default. The contrast argument of romer is like a combination of the coef and contrast arguments of glmLRT. If it is of length one, it is taken to be a coef. If it's a vector, it's taken to be a contrast." I hope this advice is useful to others wishing to undertake GSEA with RNA-Seq data (it is for me!). Best Iain --- On Thu, 1/9/11, Iain Gallagher <iaingallagher at="" btopenworld.com=""> wrote: > From: Iain Gallagher <iaingallagher at="" btopenworld.com=""> > Subject: Re: [BioC] GSEA and edgeR > To: "Sean Davis" <sdavis2 at="" mail.nih.gov=""> > Cc: smyth at wehi.edu.au, "bioconductor" <bioconductor at="" stat.math.ethz.ch=""> > Date: Thursday, 1 September, 2011, 16:21 > Hi Sean > > Thanks for your reply. > > I have already carried out a goseq analysis on my data. > > As I understand it goseq (like GOstats) looks for > enrichment in a list of genes defined a priori as regulated > and flags categories those genes belong to as enriched or > not in that list of regulated genes. > > I was more interested in testing 'sets' as a whole without > defining a list of regulated genes up front. > > I should have mentioned in my original message that I'd > used goseq already since the two techniques are related. > Aplologies for that. > > Best > > Iain > > > > --- On Thu, 1/9/11, Sean Davis <sdavis2 at="" mail.nih.gov=""> > wrote: > > > From: Sean Davis <sdavis2 at="" mail.nih.gov=""> > > Subject: Re: [BioC] GSEA and edgeR > > To: "Iain Gallagher" <iaingallagher at="" btopenworld.com=""> > > Cc: "bioconductor" <bioconductor at="" stat.math.ethz.ch="">, > smyth at wehi.edu.au > > Date: Thursday, 1 September, 2011, 16:11 > > Hi, Iain. > > > > Before proceeding, you might want to take a look at: > > > > http://bioconductor.org/packages/release/bioc/html/goseq.html > > > > This paper and others like it describe the problem > and > > another solution: > > > > http://www.ncbi.nlm.nih.gov/pubmed/21252076 > > > > Sean > > > > > > On Thu, Sep 1, 2011 at 9:13 AM, Iain Gallagher > > <iaingallagher at="" btopenworld.com=""> > > wrote: > > > Dear List / Gordon > > > > > > I would like to carry out a GSEA analysis on > some > > RNA-Seq data. I have analysed the data using edgeR and > have > > a list of regulated genes. The data is paired - 6 > biological > > samples, cells from each are infected or uninfected. > > > > > > Looking around there is little advice on the use > of > > GSEA in RNA-Seq data. Using edgeR (and having a > relatively > > small sample size) I was hoping to make use of the > romer > > algorithm which is implemented in limma. However I am > having > > a conceptual problem setting up my data for this. > > > > > > As the study uses paired sample data I have > followed > > the guidance for this type of analysis in the > marvellous > > edgeR manual. Searching for advice on how to conduct > GSEA I > > came across this post (http://www.mail-archive.com/bioc-sig- sequencing at r-project.org/msg02020.html) > > by Gordon Smyth (author of edgeR / limma) wherein he > > describes a suitable strategy for edgeR and the > rotational > > GSEA algorithms. > > > > > > These algorithms require the specification of a > > contrast (in my case between infected and uninfected > > animals). > > > > > > e.g. test <- romer(geneIndex, countData, > > design=design, contrast=contr[,1], nrot=9999) > > > > > > My design matrix looks like this: > > > > > > library(limma) > > > > > > samples <- rep(c('s1', 's2', 's3', 's4', > 's5', > > 's6'),2) > > > infection <- c(rep(1,6), rep(2,6)) > > > > > > s <- as.factor(samples) > > > i <- as.factor(infection) > > > > > > design <- model.matrix(~s+i) > > > colnames(design)[7] <- 'infection' > > > > > >> design > > > ? (Intercept) ss2 ss3 ss4 ss5 ss6 infection > > > 1 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? > ? ? > > ? 0 > > > 2 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? > ? ? > > ? 0 > > > 3 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 0 ? > ? ? > > ? 0 > > > 4 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 0 ? > ? ? > > ? 0 > > > 5 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 1 ? 0 ? > ? ? > > ? 0 > > > 6 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 1 ? > ? ? > > ? 0 > > > 7 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? > ? ? > > ? 1 > > > 8 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? > ? ? > > ? 1 > > > 9 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 0 ? > ? ? > > ? 1 > > > 10 ? ? ? ? ? 1 ? 0 ? 0 ? 1 ? 0 ? 0 ? > ? ? > > ? 1 > > > 11 ? ? ? ? ? 1 ? 0 ? 0 ? 0 ? 1 ? 0 ? > ? ? > > ? 1 > > > 12 ? ? ? ? ? 1 ? 0 ? 0 ? 0 ? 0 ? 1 ? > ? ? > > ? 1 > > > attr(,"assign") > > > [1] 0 1 1 1 1 1 2 > > > attr(,"contrasts") > > > attr(,"contrasts")$s > > > [1] "contr.treatment" > > > > > > attr(,"contrasts")$i > > > [1] "contr.treatment" > > > > > > and in edgeR I carry out the following fit of a > GLM to > > get the genes regulated between infected and > uninfected > > animals i.e. fit a GLM for infection and no infection > then > > edgeR carries out likeliehood tests to find the genes > where > > the two models differ (I think!). > > > > > > #dispersion estimate > > > d <- estimateGLMCommonDisp(d, design) > > > > > > #fit the NB GLM for each gene > > > fitFiltered <- glmFit(d, design, dispersion = > > d$common.dispersion) > > > > > > #carry out the likliehood ratio test > > > lrtFiltered <- glmLRT(d, fitFiltered, coef = > 7) > > > > > > #how many genes are DE? > > > summary(decideTestsDGE(lrtFiltered))#DE genes > > > > > > So (finally) my question is how do I set up a > contrast > > (whilst keeping the pairing) for romer? > > > > > > Thanks > > > > > > Best > > > > > > Iain > > > > > >> sessionInfo() > > > R version 2.13.1 (2011-07-08) > > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > > > locale: > > > ?[1] LC_CTYPE=en_GB.utf8 ? ? ? LC_NUMERIC=C > > > ?[3] LC_TIME=en_GB.utf8 ? ? ? > > ?LC_COLLATE=en_GB.utf8 > > > ?[5] LC_MONETARY=C ? ? ? ? ? ? > > LC_MESSAGES=en_GB.utf8 > > > ?[7] LC_PAPER=en_GB.utf8 ? ? ? LC_NAME=C > > > ?[9] LC_ADDRESS=C ? ? ? ? ? ? > ?LC_TELEPHONE=C > > > [11] LC_MEASUREMENT=en_GB.utf8 > LC_IDENTIFICATION=C > > > > > > attached base packages: > > > [1] stats ? ? graphics ?grDevices utils ? ? > > datasets ?methods ? base > > > > > > other attached packages: > > > ?[1] annotate_1.30.0 ? ? ? ?GO.db_2.5.0 ? > ? ? > > ? ? ?goseq_1.4.0 > > > ?[4] geneLenDataBase_0.99.7 BiasedUrn_1.04 ? ? > ? > > ? org.Bt.eg.db_2.5.0 > > > ?[7] RSQLite_0.9-4 ? ? ? ? ?DBI_0.2-5 ? ? > ? > > ? ? ? ?AnnotationDbi_1.14.1 > > > [10] Biobase_2.12.2 ? ? ? ? edgeR_2.2.5 ? ? > ? > > ? ? ?limma_3.8.3 > > > > > > loaded via a namespace (and not attached): > > > ?[1] biomaRt_2.8.1 ? ? ? ? Biostrings_2.20.3 > ? > > ? BSgenome_1.20.0 > > > ?[4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 > ? > > grid_2.13.1 > > > ?[7] IRanges_1.10.6 ? ? ? ?lattice_0.19-33 > ? ? > > ? Matrix_0.9996875-3 > > > [10] mgcv_1.7-6 ? ? ? ? ? ?nlme_3.1-102 ? > ? ? > > ? ?RCurl_1.6-9 > > > [13] rtracklayer_1.12.4 ? ?tools_2.13.1 ? ? > ? ? > > ?XML_3.4-2 > > > [16] xtable_1.5-6 > > >> > > > > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 12 hours ago
WEHI, Melbourne, Australia
Dear Iain, Before I attempt a fuller reply, can you tell me what gene sets you are planning to use for this analysis? Are you planning to use a large collection of sets, like the MSigDB or similar, or do you have a small number of gene sets of special relevance to your experiment? Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Thu, 1 Sep 2011, Iain Gallagher wrote: Dear List / Gordon I would like to carry out a GSEA analysis on some RNA-Seq data. I have analysed the data using edgeR and have a list of regulated genes. The data is paired - 6 biological samples, cells from each are infected or uninfected. Looking around there is little advice on the use of GSEA in RNA-Seq data. Using edgeR (and having a relatively small sample size) I was hoping to make use of the romer algorithm which is implemented in limma. However I am having a conceptual problem setting up my data for this. As the study uses paired sample data I have followed the guidance for this type of analysis in the marvellous edgeR manual. Searching for advice on how to conduct GSEA I came across this post (http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg02020.html) by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable strategy for edgeR and the rotational GSEA algorithms. These algorithms require the specification of a contrast (in my case between infected and uninfected animals). e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999) My design matrix looks like this: library(limma) samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2) infection <- c(rep(1,6), rep(2,6)) s <- as.factor(samples) i <- as.factor(infection) design <- model.matrix(~s+i) colnames(design)[7] <- 'infection' > design (Intercept) ss2 ss3 ss4 ss5 ss6 infection 1 1 0 0 0 0 0 0 2 1 1 0 0 0 0 0 3 1 0 1 0 0 0 0 4 1 0 0 1 0 0 0 5 1 0 0 0 1 0 0 6 1 0 0 0 0 1 0 7 1 0 0 0 0 0 1 8 1 1 0 0 0 0 1 9 1 0 1 0 0 0 1 10 1 0 0 1 0 0 1 11 1 0 0 0 1 0 1 12 1 0 0 0 0 1 1 attr(,"assign") [1] 0 1 1 1 1 1 2 attr(,"contrasts") attr(,"contrasts")$s [1] "contr.treatment" attr(,"contrasts")$i [1] "contr.treatment" and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!). #dispersion estimate d <- estimateGLMCommonDisp(d, design) #fit the NB GLM for each gene fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion) #carry out the likliehood ratio test lrtFiltered <- glmLRT(d, fitFiltered, coef = 7) #how many genes are DE? summary(decideTestsDGE(lrtFiltered))#DE genes So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer? Thanks Best Iain > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8 [7] LC_PAPER=en_GB.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0 [4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0 [7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1 [10] Biobase_2.12.2 edgeR_2.2.5 limma_3.8.3 loaded via a namespace (and not attached): [1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0 [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1 [7] IRanges_1.10.6 lattice_0.19-33 Matrix_0.9996875-3 [10] mgcv_1.7-6 nlme_3.1-102 RCurl_1.6-9 [13] rtracklayer_1.12.4 tools_2.13.1 XML_3.4-2 [16] xtable_1.5-6 > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 12 hours ago
WEHI, Melbourne, Australia
Dear Iain, The short answer is that the edgeR analysis you describe seems perfectly correct, but there is no way to use romer() as part of this analysis. Of the gene set tests in limma, the only ones that can be used as part of a negative binomial based analysis of count data are wilcoxGST and geneSetTest. The others, roast and romer, make multivariate normal assumptions that are incompatible with count distributions. To make a gene set test using geneSetTest, you would do something like index <- lrtFilter$genes$ID %in% MyGeneSetSymbols p.value <- geneSetTest(index, lrtFiltered$table$LR.statistic, type="f") The approach does treat the genes as statistically independent, so gives p-values that are bit optimistic, but otherwise it is a productive approach. To use romer, you would have to make some normal approximations: effective.lib.size <- d$samples$lib.size * d$samples$norm.factors y <- log2( t(t(d$counts+0.5) / (effective.lib.size+0.5)) ) Then you can use test <- romer(geneIndex, y, design, contrast=7) Actually, you don't even need to set the contrast argument, as the last coefficient is the default. The contrast argument of romer is like a combination of the coef and contrast arguments of glmLRT. If it is of length one, it is taken to be a coef. If it's a vector, it's taken to be a contrast. Best wishes Gordon -------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Thu, 1 Sep 2011, Iain Gallagher wrote: Dear List / Gordon I would like to carry out a GSEA analysis on some RNA-Seq data. I have analysed the data using edgeR and have a list of regulated genes. The data is paired - 6 biological samples, cells from each are infected or uninfected. Looking around there is little advice on the use of GSEA in RNA-Seq data. Using edgeR (and having a relatively small sample size) I was hoping to make use of the romer algorithm which is implemented in limma. However I am having a conceptual problem setting up my data for this. As the study uses paired sample data I have followed the guidance for this type of analysis in the marvellous edgeR manual. Searching for advice on how to conduct GSEA I came across this post (http://www.mail-archive.com/bioc-sig-sequencing at r-project.org/msg02020.html) by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable strategy for edgeR and the rotational GSEA algorithms. These algorithms require the specification of a contrast (in my case between infected and uninfected animals). e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999) My design matrix looks like this: library(limma) samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2) infection <- c(rep(1,6), rep(2,6)) s <- as.factor(samples) i <- as.factor(infection) design <- model.matrix(~s+i) colnames(design)[7] <- 'infection' > design (Intercept) ss2 ss3 ss4 ss5 ss6 infection 1 1 0 0 0 0 0 0 2 1 1 0 0 0 0 0 3 1 0 1 0 0 0 0 4 1 0 0 1 0 0 0 5 1 0 0 0 1 0 0 6 1 0 0 0 0 1 0 7 1 0 0 0 0 0 1 8 1 1 0 0 0 0 1 9 1 0 1 0 0 0 1 10 1 0 0 1 0 0 1 11 1 0 0 0 1 0 1 12 1 0 0 0 0 1 1 attr(,"assign") [1] 0 1 1 1 1 1 2 attr(,"contrasts") attr(,"contrasts")$s [1] "contr.treatment" attr(,"contrasts")$i [1] "contr.treatment" and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!). #dispersion estimate d <- estimateGLMCommonDisp(d, design) #fit the NB GLM for each gene fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion) #carry out the likliehood ratio test lrtFiltered <- glmLRT(d, fitFiltered, coef = 7) #how many genes are DE? summary(decideTestsDGE(lrtFiltered))#DE genes So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer? Thanks Best Iain > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8 [7] LC_PAPER=en_GB.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0 [4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0 [7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1 [10] Biobase_2.12.2 edgeR_2.2.5 limma_3.8.3 loaded via a namespace (and not attached): [1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0 [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1 [7] IRanges_1.10.6 lattice_0.19-33 Matrix_0.9996875-3 [10] mgcv_1.7-6 nlme_3.1-102 RCurl_1.6-9 [13] rtracklayer_1.12.4 tools_2.13.1 XML_3.4-2 [16] xtable_1.5-6 > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT

Login before adding your answer.

Traffic: 471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6