Hello,
I am trying to make a volcano plot, but of Log Fold Change vs FDR adjusted p-values. So far, I've managed to make a decent plot, but I'm having trouble figuring out how to label the genes that are differentially expressed, wether as the top, say 30, genes, or all of the genes above a certain threshold, and identify them on the graph. My progress so far goes thusly (Also, if anyone knows how to get limma to plot a graph with these axis instead of log fold change vs log odds, that would be greatly appreciated as well):
source ("http://bioconductor.org/biocLite.R")#f o r #downloading biocLite biocLite()#for installing bioconductor packages biocLite("BiocUpgrade")#use latest version biocLite("limma")#linear fits biocLite("GEOquery")#downloading data sets biocLite("statmod")# biocLite("affyPLM")#normalizing ExpressionSets
#begin loading libraries library("limma") library("GEOquery") library("statmod") library("affyPLM") #libraries loaded
gse44563 <- getGEO("GSE44563", GSEMatrix = TRUE) #downloading series record #44563 with any attached data eset44563 = gse44563[[1]]#extract attached eSet (data)
y <- log2(exprs(eset44563)) design <- model.matrix(~source_name_ch1,data=pData(eset44563)) preFit <- lmFit(y, design)
eset <- normalize.ExpressionSet.quantiles(eset44563, transfn="none" ) #quantile normalization of eset44563 #8 y <- log2(exprs(eset))#log2 transformation of eset design <- model.matrix(~source_name_ch1,data=pData(eset44563))#making #design matrix #describes what array looked like #allows for a linear fit , using matrix algebra fit <- lmFit(y, design)#linear fit
plotMA(fit)
IsExpressed <- fit$Amean>3#is mean expression #of each gene above 3? efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)#Empi r ical #Bayes adjustment
plotMA(efit)
write.csv((topTable(efit, adjust = "fdr", n = 54675)), "possible5.csv") #write table of genes, #sorted according to False Discovery Rate adjusted #p-value to file
tableTop <- topTable(efit, adjust = "fdr", n = 42471) plot(tableTop$logFC, -log10(tableTop$adj.P.Val), xlim = c(-1, 2), ylim = c(0, 6), xlab = "log2 fold change", ylab = "FDR adjusted p-value")
My issue right now is trying to figure out how to label the points that are meaningfully foldchanged, for now, without using limma, since it appears that limma only uses logodds as its measure of statistical significance.
Any help, or even just pointers to pre-existing explanations would be immensly appreciated. Thank you
Also, here's my SessionInfo():
R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] splines parallel stats graphics [5] grDevices utils datasets methods [9] base other attached packages: [1] BiocInstaller_1.16.0 affyPLM_1.40.1 [3] preprocessCore_1.26.1 gcrma_2.36.0 [5] statmod_1.4.20 GEOquery_2.30.1 [7] limma_3.20.9 affy_1.42.3 [9] Biobase_2.24.0 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] affyio_1.32.0 Biostrings_2.32.1 [3] IRanges_1.22.10 RCurl_1.95-4.3 [5] stats4_3.1.1 tools_3.1.1 [7] XML_3.98-1.1 XVector_0.4.0 [9] zlibbioc_1.10.0