I would do much of this a different way. Sean Davis might chime in with some information about why one might want to use a GPL object, but I can't personally remember ever thinking that I should do that. Instead I always use the GSE object (or use getGEOSuppFiles
to get the underlying CEL files, which in this case would be what I would do, as these data are MAS5.0 and I would instead use RMA). But anyway
> eset <- getGEO("GSE2457")[[1]]
Found 1 file(s)
GSE2457_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2457/matrix/GSE2457_series_matrix.txt.gz'
Content type 'application/x-gzip' length 394616 bytes (385 KB)
downloaded 385 KB
> eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 15923 features, 10 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM46500 GSM46503 ... GSM46522 (10 total)
varLabels: title geo_accession ... data_row_count (31 total)
varMetadata: labelDescription
featureData
featureNames: 1367452_at 1367453_at ... AFFX_ratb2/X14115_at (15923
total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 16118269
Annotation: GPL341
You should note that this experiment was based on penile cavernosa tissue from rats and was run on RAE230A arrays, so there is no reason to be using an HG-U133_Plus2 annotation package. But you might want to re-annotate, because in general what you get from GEO for Affy arrays is just a parsed version of their annotation data, which are not as useful for this purpose as one might think. Being the author of the package, I would tend to use affycoretools
, but there are obviously other things you could do.
> library(affycoretools)
> library(rae230a.db)
> eset <- annotateEset(eset, rae230a.db)
> head(fData(eset))
PROBEID ENTREZID SYMBOL GENENAME
1367452_at 1367452_at 690244 Sumo2 small ubiquitin-like modifier 2
1367453_at 1367453_at 114562 Cdc37 cell division cycle 37, HSP90 cochaperone
1367454_at 1367454_at 60384 Copb2 COPI coat complex subunit beta 2
1367455_at 1367455_at 116643 Vcp valosin-containing protein
1367456_at 1367456_at 81920 Ube2d3 ubiquitin-conjugating enzyme E2D 3
1367457_at 1367457_at 114558 Becn1 beclin 1
And now if you want to average the duplicated probes (where by 'duplicated' I mean those with the same NCBI Gene ID, which IMO isn't necessarily duplicated).
> z <- avereps(eset, ID = fData(eset)$ENTREZID)
Warning messages:
1: In rowsum.default(x, ID, reorder = FALSE, na.rm = TRUE) :
missing values for 'group'
2: In rowsum.default(1L - is.na(x), ID, reorder = FALSE) :
missing values for 'group'
## that warning is just telling us that there are NA values for some of the NCBI Gene IDs, so we can silence it
## by subsetting (which we should do because we don't want the average of the NA probes).
> eset2 <- eset[!is.na(fData(eset)$ENTREZID),]
> zz <- avereps(eset2, fData(eset2)$ENTREZID)
And limma is happy to use the matrix zz
, or you could stick back into an ExpressionSet
if you like. Do note that the row.names
of the matrix are now NCBI Gene IDs, so you can use that information to annotate things downstream.
seems to be cross-posted? - https://www.biostars.org/p/486578/