Question

beadchip annotation data missing gene "names"

0

Entering edit mode

smt8n • 0

@smt8n-9982

Last seen 8.4 years ago

Dear all,

I made an annotation table for beadchip data as following (for the symbol as an example):

if(version==3) {
        require(illuminaHumanv3.db) # Use appropriate annotation DB for expression arrays
        kegg <- illuminaHumanv3PATH        
        geneName <- illuminaHumanv3GENENAME
        hugo <- illuminaHumanv3SYMBOL
        pq <- illuminaHumanv3PROBEQUALITY
    }

## grabs HUGO symbol for the gene
    mapped <- mappedkeys(hugo)
    hugo2 <- as.list(hugo[mapped])
    hugoDF <- data.frame(PROBE_ID = as.character(names(hugo2)), SYMBOL = as.character(hugo2))

And the same for other entries, with later merging them like this:

    expressionAnnotationData <- merge(keggDF, geneNameDF,by.x="PROBE_ID", by.y="PROBE_ID", all = T)
    expressionAnnotationData <- merge(expressionAnnotationData, hugoDF, by.x="PROBE_ID", by.y="PROBE_ID", all = T)
    expressionAnnotationData <- merge(expressionAnnotationData, pqDF, by.x="PROBE_ID", by.y="PROBE_ID", all = T)

However, it has A LOT of probe ids not mapped to anything. Initially I thought that this was a case of probes not having proper matches, but the table has plenty of "PERFECT" quality probes:

        PROBE_ID keggID keggName geneNameID SYMBOL     QUALITY

31545 ILMN_1670545   <NA>     <NA>       <NA>   <NA>     Perfect
31546 ILMN_1670589   <NA>     <NA>       <NA>   <NA>         Bad
31547 ILMN_1670625   <NA>     <NA>       <NA>   <NA>     Perfect
31548 ILMN_1670641   <NA>     <NA>       <NA>   <NA>    No match
31549 ILMN_1670666   <NA>     <NA>       <NA>   <NA>         Bad
31550 ILMN_1670668   <NA>     <NA>       <NA>   <NA>     Perfect
31551 ILMN_1670674   <NA>     <NA>       <NA>   <NA>         Bad
31552 ILMN_1670715   <NA>     <NA>       <NA>   <NA>     Perfect
31553 ILMN_1670716   <NA>     <NA>       <NA>   <NA>         Bad
31554 ILMN_1670737   <NA>     <NA>       <NA>   <NA>     Perfect
31555 ILMN_1670757   <NA>     <NA>       <NA>   <NA>         Bad
31556 ILMN_1670800   <NA>     <NA>       <NA>   <NA>         Bad
31557 ILMN_1670805   <NA>     <NA>       <NA>   <NA>    Good****

Am I doing something wrong in extracting annotation? If not, is this normal and what should I do?

Thank you

Slava

probe to gene id illumina beadchip probe quality • 1.6k views

ADD COMMENT • link updated 9.0 years ago by James W. MacDonald 68k • written 9.0 years ago by smt8n • 0

score 0 · Answer 1 · 2016-04-22

You will probably get better results if you use more modern methods of extracting information. Using the BiMap interface is pretty old school, and it doesn't handle one-to-many mappings the way one would normally like. But like you say, there are lots of things that are not annotated.

Edit: forgot to add where the 'ks' vector came from

> ks <- keys(illuminaHumanv4.db)
> toget <- c("PATH","GENENAME","SYMBOL","ENTREZID","ENSEMBL", "ACCNUM")
> df <- as.data.frame(lapply(toget, function(x) mapIds(illuminaHumanv4.db, ks, x, "PROBEID")))
> names(df) <- c("KEGG","GENENAME","SYMBOL","ENTREZID","ENSEMBL", "ACCNUM")

## probequality isn't a column
> z <- toTable(illuminaHumanv4PROBEQUALITY)
> df$PROBEQUALITY <- z[match(row.names(df), z[,1]),2]
> head(df)
             KEGG GENENAME SYMBOL ENTREZID ENSEMBL ACCNUM PROBEQUALITY
ILMN_1343048 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343049 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343050 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343052 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343059 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
ILMN_1343061 <NA>     <NA>   <NA>     <NA>    <NA>   <NA>     No match
> df.justmatches <- df[!df$PROBEQUALITY %in% "No match",]
> head(df.justmatches)
              KEGG                                           GENENAME  SYMBOL
ILMN_1343291 03013 eukaryotic translation elongation factor 1 alpha 1  EEF1A1
ILMN_1343295 00010           glyceraldehyde-3-phosphate dehydrogenase   GAPDH
ILMN_1651199  <NA>                                               <NA>    <NA>
ILMN_1651209  <NA>                solute carrier family 35, member E2 SLC35E2
ILMN_1651210  <NA>                                               <NA>    <NA>
ILMN_1651221  <NA>                   EF-hand calcium binding domain 1  EFCAB1
             ENTREZID         ENSEMBL   ACCNUM PROBEQUALITY
ILMN_1343291     1915 ENSG00000156508 AAA52343      Perfect
ILMN_1343295     2597 ENSG00000111640 AAA52496      Perfect
ILMN_1651199     <NA>            <NA>     <NA>          Bad
ILMN_1651209     9906 ENSG00000215790 AAI01667      Perfect
ILMN_1651210     <NA>            <NA>     <NA>          Bad
ILMN_1651221    79645 ENSG00000034239 AAH25676          Bad
> apply(df, 2, function(x) sum(is.na(x)))
        KEGG     GENENAME       SYMBOL     ENTREZID      ENSEMBL       ACCNUM
       37638        12221        12221        12221        13756        12221
PROBEQUALITY
           0
> apply(df.justmatches, 2, function(x) sum(is.na(x)))
        KEGG     GENENAME       SYMBOL     ENTREZID      ENSEMBL       ACCNUM
       36391        10987        10987        10987        12517        10987
PROBEQUALITY
           0
> apply(df.justmatches[!df.justmatches$PROBEQUALITY %in% "Bad",], 2, function(x) sum(is.na(x)))
        KEGG     GENENAME       SYMBOL     ENTREZID      ENSEMBL       ACCNUM
       25574         4242         4242         4242         5316         4242
PROBEQUALITY
           0

So most of the 'not bad' probes have annotations. But it does look like there is lots of speculative content on this array.