Question

So many unmapped probes

0

Entering edit mode

Ed Siefker ▴ 230

@ed-siefker-5136

Last seen 15 months ago

United States

I have microarray data downloaded from ArrayExpress. The annotation is listed as "pd.hugene.1.0.st.v1". I'm trying to annotate them with hugene10sttranscriptcluster.db. My problem is that a large number of probes map to no symbol or refseq. Is this normal?

```

> mydata.rma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 33297 features, 14 samples
element names: exprs
protocolData
rowNames: GSM946485_48-2.CEL GSM946484_48-1.CEL ... GSM946472_0-1.CEL
    (14 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: GSM946485_48-2.CEL GSM946484_48-1.CEL ... GSM946472_0-1.CEL
    (14 total)
varLabels: Source.Name Comment..Sample_source_name. ...
    FactorValue..TIME. (27 total)
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.0.st.v1
> ids<-rownames(exprs(mydata.rma))
> length(ids)
[1] 33297
> symbols <- AnnotationDbi::mapIds(hugene10sttranscriptcluster.db, ids, "SYMBOL", "PROBEID")
'select()' returned 1:many mapping between keys and columns
> sumis.na(symbols))
[1] 11147
>
```
A third of these probes match no symbol. That seems really high. What's going on? Is this the wrong .db package to use for this platform?

annotation hugene10sttranscriptcluster.db annotation • 970 views

ADD COMMENT • link written 6.9 years ago by Ed Siefker ▴ 230

score 2 · Accepted Answer · 2018-03-07

Your counting is off, primarily because you are naively assuming that everything on an Affy array actually measures something that should have a gene symbol.

con <- db(pd.hugene.1.0.st.v1)

> ids <- dbGetQuery(con, "select transcript_cluster_id from featureSet where type='1';")[,1]
> length(ids)
[1] 253002

> sum(is.na(mapIds(hugene10sttranscriptcluster.db, as.character(ids), "ENTREZID","PROBEID")))
'select()' returned 1:many mapping between keys and columns
[1] 7567

Which means that about 4.4% of the 'main' probes have no Entrez Gene ID, and are probably some speculative content, or lincRNAs or whatnot.