Hi.
This is a similiar issue reported here : annotating microarray data with mogene10stv1
So I'm annotating a microarray .CEL data pulled from GEO by an accession number using GEOquery, and am unable to annotate the resultant expression set using BiomaRt (using the respective hg_u133a_2.db). The error is that I receive all NAs.
The first issue is that I'm unable to use affycoretools to filter the main probes.
qus<-getGEO("GSE30550")[[1]]
getMainProbes(qus)
Error: The file GPL9188 does not appear to have been processed using the oligo package.
In this case the argument to this function should be the name of the correct pd.info package (e.g., pd.hugene.1.0.st.v1.
traceback()
2: stop(paste("The file", pdinfo, "does not appear to have been processed using",
"the oligo package.\nIn this case the argument to this function should",
"be the name of the correct pd.info package (e.g., pd.hugene.1.0.st.v1.\n"),
call. = FALSE)
1: getMainProbes(qus)
for an attempt with biomaRt
here the affyids are the featureNames(qus)
maRt<-getBM(attributes=c('affy_hg_u133a_2','entrezgene'),filters='affy_hg_u133a_2',values=affyids,mart=ensembl)
maRt[affyids %in% maRt[,1],]
affy_hg_u133a entrezgene
NA <NA> NA
NA.1 <NA> NA
NA.2 <NA> NA
NA.3 <NA> NA
NA.4 <NA> NA
NA.5 <NA> NA
NA.6 <NA> NA
NA.7 <NA> NA
NA.8 <NA> NA
NA.9 <NA> NA
NA.10 <NA> NA
NA.11 <NA> NA
NA.12 <NA> NA
NA.13 <NA> NA
NA.14 <NA> NA
NA.15 <NA> NA
NA.16 <NA> NA
NA.17 <NA> NA
NA.18 <NA> NA
NA.19 <NA> NA
NA.20 <NA> NA
so i'm getting a bunch of control probes returned without any gene names. and am not able to use the affycoretools
qus
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11961 features, 268 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM757898 GSM757899 ... GSM758165 (268 total)
varLabels: title geo_accession ... data_row_count (35 total)
varMetadata: labelDescription
featureData
featureNames: 10000_at 10001_at ... 9_at (11961 total)
fvarLabels: ID Gene_ID Description
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL9188
>
I've checked that the annotation of GPL9188 is corresponding to hgu133a version 2 chip. Yet I'm having trouble annotating the correct probes.
The alternate solution is to build the expression set using the affy package, and oligio package, however I was hoping the GEOquery package could build it faster. any suggestions greatly appreciated.
here's my info
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tools stats4 parallel stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] GEOquery_2.36.0 biomaRt_2.26.1
[3] artemisData_0.21 matrixStats_0.50.1
[5] artemis_0.42.5 beeswarm_0.2.1
[7] SummarizedExperiment_1.0.2 TxDbLite_1.9.100
[9] stringdist_0.9.4.1 knitr_1.12
[11] rtracklayer_1.30.1 ensembldb_1.2.1
[13] GenomicFeatures_1.22.8 AnnotationDbi_1.32.3
[15] GenomicRanges_1.22.3 RSQLite_1.0.0
[17] DBI_0.3.1 Biobase_2.30.0
[19] qusage_2.2.0 limma_3.26.5
[21] GenomeInfoDb_1.6.1 IRanges_2.4.6
[23] S4Vectors_0.8.7 BiocGenerics_0.16.1
[25] jsonlite_0.9.19
loaded via a namespace (and not attached):
[1] TH.data_1.0-6 colorspace_1.2-6
[3] hwriter_1.3.2 estimability_1.1-1
[5] qvalue_2.2.2 futile.logger_1.4.1
[7] XVector_0.10.0 interactiveDisplayBase_1.8.0
[9] mvtnorm_1.0-4 codetools_0.2-14
[11] splines_3.2.3 R.methodsS3_1.7.0
[13] DESeq_1.22.1 geneplotter_1.48.0
[15] pathview_1.10.1 Rsamtools_1.22.0
[17] annotate_1.48.0 png_0.1-7
[19] R.oo_1.19.0 graph_1.48.0
[21] shiny_0.13.0 httr_1.0.0
[23] lsmeans_2.21-1 Matrix_1.2-3
[25] htmltools_0.3 coda_0.18-1
[27] gtable_0.1.2 reshape2_1.4.1
[29] ShortRead_1.28.0 Rcpp_0.12.3
[31] Biostrings_2.38.3 gdata_2.17.0
[33] nlme_3.1-123 stringr_1.0.0
[35] mime_0.4 gtools_3.5.0
[37] XML_3.98-1.3 erccdashboard_1.4.0
[39] AnnotationHub_2.2.3 edgeR_3.12.0
[41] zlibbioc_1.16.0 MASS_7.3-45
[43] zoo_1.7-12 scales_0.3.0
[45] aroma.light_3.0.0 BiocInstaller_1.20.1
[47] RBGL_1.46.0 KEGGgraph_1.28.0
[49] sandwich_2.3-4 rhdf5_2.14.0
[51] QuasiSeq_1.0-8 lambda.r_1.1.7
I just remembered that annoateEset is in the devel version of affycoretools, so unless you want to use devel tools, you will have to use select() or mapIds() to generate a data.frame of annotations that you can then use.
If I use the select() method the symbols are NA
test<-select(hgu133a2.db,affyids,"SYMBOL","PROBEID")
head(test)
PROBEID SYMBOL
1 10000_at <NA>
2 10001_at <NA>
3 10002_at <NA>
4 10003_at <NA>
5 10004_at <NA>
6 10005_at <NA>
>
the amount of NA is so many.
I solved it using this
library(affy)
library(oligo)
celFiles<- list.celfiles(getwd(),listGzipped=TRUE)
affyRaw<-read.celfiles(celFiles)
#library(pd.hg.u133a.2)
eset<-rma(affyRaw)
library(hgu133a2.db)
Annot<-data.frame(ACCNUM=sapply(contents(hgu133a2ACCNUM),
paste,
collapse=", "),
SYMBOL=sapply(contents(hgu133a2SYMBOL),paste,
collapse=", "))
indx<-featureNames(eset) %in% rownames(Annot)
qus<-eset[!duplicated(Annot[indx,2]),]
featureNames(qus)<-Annot[!duplicated(Annot[indx,2]),2]
qus<-qus[featureNames(qus[which(featureNames(qus)!="NA"),]),]
there was issues with rownames not being unique. not sure if this is a good idea to squash out non-unique genenames' probe ID.
open to suggestions
The GEO entry you are analyzing is based on the NCBI re-mapped CDF, where they re-mapped the data to Entrez Gene. So all of the probesets are of the form <entrezgeneid>_at.