Hi all.
After a lengthy struggle (I am a beginning student), I was able extract a list of genes from one GSE which contained 25 GSMs. From this GSE, I extracted a list of gene symbols after which I removed all of the duplicates. So, I am now assuming that the gene symbols I obtained are from all of the GSMs within that GSE? If so, I am attempting to perform the same operation on six additional GSEs before removing duplicates and using the RobustRankAggreg package, aggregating all of the genes into one data object (vector?) before performing a differential expression analysis with the limma package to isolate a smaller set of DEGs that I could use with Cytoscape via RCy3. Unfortunately, I don't know how to retrieve genes from multiple GSE objects or how one would combine all of the gene symbols. Can limma accept one monolithic data structure of gene symbols and could a boxplot be constructed from it even though there would presumably need to be two categories? Please accept my apologies if any of the above text is incorrect. I am working on my own without any form of guidance from my PI, TA, or classmates. I am, however, finding the limma course from DataCamp to be quite useful.
Code should be placed in three backticks as shown below
library(GEOquery)
library(limma)
gse <- getGEO("GSE12657", AnnotGPL = TRUE)
features <- fData(gse[[1]])
View(features)
length(features$`Gene symbol`)
# Remove duplicates leaving 9156 unique genes
features <- features[!duplicated(features$`Gene symbol`),]
# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session
sessionInfo( )
Hi James.
Very nice. Thank you. I was just looking at an email from that postdoc I consulted several weeks ago and this is what she said. Could you please explain as I am quite confused, working under an encroaching deadline, and without benefit the classmate or PI:
pData(gse) returns to the data frame of sample information only. A bit more idea of how gse ExpressionSet object looks like...you could imagine it is a multi-tab excel table. For example, when we extract the "sampledata <- pData(gse[[1]])", it is one of the tabs containing the pData data frame. For the downstream, you should look for "exprs(gse)" for the expression data, in which the rows are the genes and the columns are the sample IDs. Just do the differential expression analysis using the limma package instead of DESeq2. Now, you have the gse object which contains the assayData (expression data), phenoData (sample information), and featureData (gene annotation data).
I think the first step involves calling log2 on the ExpressionSet object? The second step is puzzling, since upon using the following code: sampleData <- pData(gse[[1]]), I don't know what column I should use as the "group" nor do I know what that means (GSE12657) although limma apparently depends upon the specification of one column name? Could you please explain 3? The postdoc appears to believe this step is obvious but it most certainly is not.
From the paper I am seeking to reproduce, the authors mentioned "hub genes" and the cytohubba package from Cytoscape. Would I need to transform the data at some point with the WGCNA package?
I spent the past several hours reading the following document and my code, thus far, is below.
https://github.com/Lindseynicer/How-to-analyze-GEO-microarray-data
Update:
Thank you James for your help. I greatly appreciate it.
We have ventured far past the main purpose of this support site, which is to help people with specific questions about the software. While I sympathize with the fact that you are confused and have little time, you already have examples (the limma User's Guide, the GitHub repo you link to) that show you pretty much exactly what you should be doing. You just need to apply that information to your situation.