beadchip annotation data missing gene "names"
1
0
Entering edit mode
smt8n • 0
@smt8n-9982
Last seen 8.0 years ago

Dear all,

 

I made an annotation table for beadchip data as following (for the symbol as an example):

if(version==3) {
        require(illuminaHumanv3.db) # Use appropriate annotation DB for expression arrays
        kegg <- illuminaHumanv3PATH        
        geneName <- illuminaHumanv3GENENAME
        hugo <- illuminaHumanv3SYMBOL
        pq <- illuminaHumanv3PROBEQUALITY
    }

<skipped code>

## grabs HUGO symbol for the gene
    mapped <- mappedkeys(hugo)
    hugo2 <- as.list(hugo[mapped])
    hugoDF <- data.frame(PROBE_ID = as.character(names(hugo2)), SYMBOL = as.character(hugo2))

And the same for  other entries, with later merging them like this:

    expressionAnnotationData <- merge(keggDF, geneNameDF,by.x="PROBE_ID", by.y="PROBE_ID", all = T)
    expressionAnnotationData <- merge(expressionAnnotationData, hugoDF, by.x="PROBE_ID", by.y="PROBE_ID", all = T)
    expressionAnnotationData <- merge(expressionAnnotationData, pqDF, by.x="PROBE_ID", by.y="PROBE_ID", all = T)

 

However, it has A LOT of probe ids not mapped to anything. Initially I thought that this was a case of probes not having proper matches, but the table has plenty of "PERFECT" quality probes:

        PROBE_ID keggID keggName geneNameID SYMBOL     QUALITY
31545 ILMN_1670545   <NA>     <NA>       <NA>   <NA>     Perfect
31546 ILMN_1670589   <NA>     <NA>       <NA>   <NA>         Bad
31547 ILMN_1670625   <NA>     <NA>       <NA>   <NA>     Perfect
31548 ILMN_1670641   <NA>     <NA>       <NA>   <NA>    No match
31549 ILMN_1670666   <NA>     <NA>       <NA>   <NA>         Bad
31550 ILMN_1670668   <NA>     <NA>       <NA>   <NA>     Perfect
31551 ILMN_1670674   <NA>     <NA>       <NA>   <NA>         Bad
31552 ILMN_1670715   <NA>     <NA>       <NA>   <NA>     Perfect
31553 ILMN_1670716   <NA>     <NA>       <NA>   <NA>         Bad
31554 ILMN_1670737   <NA>     <NA>       <NA>   <NA>     Perfect
31555 ILMN_1670757   <NA>     <NA>       <NA>   <NA>         Bad
31556 ILMN_1670800   <NA>     <NA>       <NA>   <NA>         Bad
31557 ILMN_1670805   <NA>     <NA>       <NA>   <NA>    Good****

Am I doing something wrong in extracting annotation? If not, is this normal and what should I do?

 

Thank you

Slava

probe to gene id illumina beadchip probe quality • 1.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 27 minutes ago
United States

You will probably get better results if you use more modern methods of extracting information. Using the BiMap interface is pretty old school, and it doesn't handle one-to-many mappings the way one would normally like. But like you say, there are lots of things that are not annotated.

Edit: forgot to add where the 'ks' vector came from

> ks <- keys(illuminaHumanv4.db)
> toget <- c("PATH","GENENAME","SYMBOL","ENTREZID","ENSEMBL", "ACCNUM")
> df <- as.data.frame(lapply(toget, function(x) mapIds(illuminaHumanv4.db, ks, x, "PROBEID")))
> names(df) <- c("KEGG","GENENAME","SYMBOL","ENTREZID","ENSEMBL", "ACCNUM")

## probequality isn't a column
> z <- toTable(illuminaHumanv4PROBEQUALITY)
> df$PROBEQUALITY <- z[match(row.names(df), z[,1]),2]
> head(df)
             KEGG GENENAME SYMBOL ENTREZID ENSEMBL ACCNUM PROBEQUALITY
ILMN_1343048 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343049 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343050 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343052 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343059 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343061 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
> df.justmatches <- df[!df$PROBEQUALITY %in% "No match",]
> head(df.justmatches)
              KEGG                                           GENENAME  SYMBOL
ILMN_1343291 03013 eukaryotic translation elongation factor 1 alpha 1  EEF1A1
ILMN_1343295 00010           glyceraldehyde-3-phosphate dehydrogenase   GAPDH
ILMN_1651199  <NA>                                               <NA>    <NA>
ILMN_1651209  <NA>                solute carrier family 35, member E2 SLC35E2
ILMN_1651210  <NA>                                               <NA>    <NA>
ILMN_1651221  <NA>                   EF-hand calcium binding domain 1  EFCAB1
             ENTREZID         ENSEMBL   ACCNUM PROBEQUALITY
ILMN_1343291     1915 ENSG00000156508 AAA52343      Perfect
ILMN_1343295     2597 ENSG00000111640 AAA52496      Perfect
ILMN_1651199     <NA>            <NA>     <NA>          Bad
ILMN_1651209     9906 ENSG00000215790 AAI01667      Perfect
ILMN_1651210     <NA>            <NA>     <NA>          Bad
ILMN_1651221    79645 ENSG00000034239 AAH25676          Bad
> apply(df, 2, function(x) sum(is.na(x)))
        KEGG     GENENAME       SYMBOL     ENTREZID      ENSEMBL       ACCNUM
       37638        12221        12221        12221        13756        12221
PROBEQUALITY
           0
> apply(df.justmatches, 2, function(x) sum(is.na(x)))
        KEGG     GENENAME       SYMBOL     ENTREZID      ENSEMBL       ACCNUM
       36391        10987        10987        10987        12517        10987
PROBEQUALITY
           0
> apply(df.justmatches[!df.justmatches$PROBEQUALITY %in% "Bad",], 2, function(x) sum(is.na(x)))
        KEGG     GENENAME       SYMBOL     ENTREZID      ENSEMBL       ACCNUM
       25574         4242         4242         4242         5316         4242
PROBEQUALITY
           0 

So most of the 'not bad' probes have annotations. But it does look like there is lots of speculative content on this array.

ADD COMMENT
1
Entering edit mode

Note that I edited my answer to include where the 'ks' object came from.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 866 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6