Question

A major gene is missing in hugene10sttranscriptcluster.db (v.8.7.0)

0

Entering edit mode

mengyuankan ▴ 20

@mengyuankan-17933

Last seen 4.4 years ago

Hi,

I found one gene of my top interest, FKBP5, is missing in hugene10sttranscriptcluster.db (v.8.7.0), but is present in the older version v8.2.0. The codes are:

>library(hugene10sttranscriptcluster.db)

>mapped_probes <- mappedkeys(hugene10sttranscriptclusterSYMBOL)
>xx <- as.list(hugene10sttranscriptclusterSYMBOL[mapped_probes])
>'FKBP5'%in%xx

>mapped_probes <- mappedkeys(hugene10sttranscriptclusterENTREZID)
>xx <- as.list(hugene10sttranscriptclusterENTREZID[mapped_probes])
>'2289'%in%xx

>mapped_probes <- mappedkeys(hugene10sttranscriptclusterENSEMBL)
>xx <- as.list(hugene10sttranscriptclusterENSEMBL[mapped_probes])
>"ENSG00000096060"%in%xx

The outputs are all TRUE using v8.2.0 while all FALSE using v8.7.0. I haven't checked from which version it becomes missing. Since FKBP5 is a major protein-coding gene and seems unlikely missing in RefSeq, GenBank, or Entrez Gene. I'm wondering if you have any idea about this? Thanks!

Mengyuan

hugene10sttranscriptcluster.db annotation • 1.3k views

ADD COMMENT • link 6.5 years ago mengyuankan ▴ 20

score 1 · Answer 1 · 2018-10-22

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

> select(hugene10sttranscriptcluster.db, "FKBP5",c("PROBEID","ENTREZID","ENSEMBL"), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
  SYMBOL PROBEID ENTREZID         ENSEMBL
1  FKBP5 8125919     2289 ENSG00000096060
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] hugene10sttranscriptcluster.db_8.7.0 org.Hs.eg.db_3.6.0                  
[3] AnnotationDbi_1.42.1                 IRanges_2.14.10                     
[5] S4Vectors_0.18.3                     Biobase_2.40.0                      
[7] BiocGenerics_0.26.0                 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18    digest_0.6.15   DBI_1.0.0       RSQLite_2.1.1  
 [5] blob_1.1.1      tools_3.5.0     bit64_0.9-7     bit_1.1-14     
 [9] compiler_3.5.0  pkgconfig_2.0.1 memoise_1.1.0

ADD COMMENT • link 6.5 years ago James W. MacDonald 68k

0

Entering edit mode

Also do note that mappedkeys gives you the keys (probeset IDs), not the things that are mapped. So every time you used mappedkeys, you got back just the probeset IDs, so by definition none of the things you were looking for would be in there.

>all.equal(mappedkeys(hugene10sttranscriptclusterSYMBOL), mappedkeys(hugene10sttranscriptclusterENTREZID))
[1] TRUE

You should be using select to do queries, not using the old BiMap interface. You could use the keys function however:

> grep("FKBP5", keys(hugene10sttranscriptcluster.db, "SYMBOL"), value = T)
[1] "FKBP5"

or

> grep("^2289$", keys(hugene10sttranscriptcluster.db, "ENTREZID"), value = TRUE)
[1] "2289"

ADD REPLY • link 6.5 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for the explanations. The select function does show the outputs, but it cannot explain why the probe ID 8125919 is missing in mappedkeys, and also cannot explain why it exists in mappedkeys if using v.8.2.0.

"Also do note that mappedkeys gives you the keys (probeset IDs), not the things that are mapped." -- Please note that I used "as.list" to obtain a list with mapped gene symbols as values and with probeset ids as keys.

I checked 8125919 in two hugene10sttranscriptcluster.db versions:

In v8.2.0, probe id 8125919 is uniquely mapped to FKBP5:

> select(hugene10sttranscriptcluster.db, "8125919",c("PROBEID","ENTREZID","ENSEMBL","SYMBOL"), "PROBEID")

'select()' returned 1:1 mapping between keys and columns

PROBEID ENTREZID ENSEMBL SYMBOL

1 8125919 2289 ENSG00000096060 FKBP5

While in v8.7.0, 8125919 is mapped to two gene symbols:

>select(hugene10sttranscriptcluster.db, "8125919",c("PROBEID","ENTREZID","ENSEMBL","SYMBOL"), "PROBEID")

'select()' returned 1:many mapping between keys and columns

PROBEID ENTREZID ENSEMBL SYMBOL

1 8125919 2289 ENSG00000096060 FKBP5

2 8125919 285847 <NA> LOC285847

So I guess because the probe ids are mapped to more than one gene symbols and entrez ids, it was excluded from mappedkeys in the 8.7.0 version of hugene10sttranscriptcluster.db.

ADD REPLY • link 6.5 years ago mengyuankan ▴ 20

1

Entering edit mode

One reason I suggested not using the old BiMap interface is because the default was to make any multi-mapping probeset return NA, because the argument was that we couldn't say for sure what the probeset was measuring. With the newer database type interface we just return all the data, including the probes that have one to many mappings and let the end user sort it out.

In addition, what you are doing still doesn't make sense - if you convert a BiMap object to a list, the names of the list are still the probeset IDs. So you are looking for the symbol and Entrez Gene ID in the set of probeset IDs, rather than in the list members. For example

> z <- as.list(hugene10sttranscriptclusterSYMBOL)
> z[1:5]
$`7892501`
[1] NA

$`7892502`
[1] NA

$`7892503`
[1] NA

$`7892504`
[1] NA

$`7892505`
[1] NA

> z["8125919"]
$`8125919`
[1] NA

If you want the multi-mapping probes, you need to use toggleProbes first.

> zz <- toggleProbes(hugene10sttranscriptclusterSYMBOL, "all")
> zzz <- as.list(zz)
> zzz["8125919"]
$`8125919`
[1] "FKBP5"     "LOC285847"

or alternatively

> grep("FKBP5", unlist(zzz), value = TRUE)
81259191
 "FKBP5"

Also do note that we are simply re-packaging information that we get from Affy. If they update their annotation file to say that a given probeset measures something completely different, then our annotation packages will reflect that. We don't do any vetting of their (or anybody's) annotation, and are simply in the business of putting those data in a format that we think is simpler for our end users to utilize.

ADD REPLY • link 6.5 years ago James W. MacDonald 68k

0

Entering edit mode

This explanation really helps. Thanks!

ADD REPLY • link 6.5 years ago mengyuankan ▴ 20

score 0 · Answer 2 · 2018-10-23

Probe ID 8125919 of FKBP5 is not mapped to a unique gene symbol in hugene10sttranscriptcluster.db v.8.7.0 because Affy updated the annotation file.

>select(hugene10sttranscriptcluster.db, "8125919",c("PROBEID","ENTREZID","ENSEMBL","SYMBOL"), "PROBEID")
'select()' returned 1:many mapping between keys and columns
  PROBEID ENTREZID         ENSEMBL    SYMBOL
1 8125919     2289 ENSG00000096060     FKBP5
2 8125919   285847            <NA> LOC285847

According to the author, the old BiMap interface will make any multi-mapping probeset return NA. Use the select function or key function to retrieve those genes.