Question

Identifying multiple mappings with pd.clariom.d.human

0

Entering edit mode

amir.rakhimov.b • 0

@0d7e1ff2

Last seen 2.9 years ago

Germany

Hi, I'm analysing Affymetrix Clariom D microarray data and have a question about annotation.

While annotating data with pd.clariom.d.human, I had a problem with genes that mapped to the same probe. If I annotate with this package, R takes only the first of the multiple genes. Even if I use annotateEset() with multivals= "list", R returns the first value.

I tried to use clariomdhumantranscriptcluster.db, but this package caused an issue of a different kind: some probes were mapped to more genes than they should be. I realised this when comparing my results with Transcriptome Analysis Console's annotation (TAC). For example, TC0X00007229.hg.1 in TAC was mapped to GAGE12F, GAGE12J, GAGE12D, GAGE5, GAGE6, GAGE12B, GAGE4, GAGE2E. Whereas when I used select() from AnnotationDbi, the probe was mapped to GAGE12F, GAGE12J, GAGE12D, GAGE5, GAGE6, GAGE12B, GAGE4, GAGE12G, GAGE12I, GAGE7 (GAGE2E was not identified).

And I cannot use select() from AnnotationDbi, because my data has an AffyHTAPDInfo signature, not ChipDb. How can I annotate the data properly and map genes correctly?

clariomdhumantranscriptcluster.db MicroarrayData Microarray pd.clariom.d.human • 1.1k views

ADD COMMENT • link 3.0 years ago amir.rakhimov.b • 0

score 0 · Answer 1 · 2022-05-03

I wrote the parser that was used to generate that ChipDb package. The basic idea is that you first need a text file that has the probeset ID in one column and another ID in the second column. I chose to use the GenBank and RefSeq IDs that are provided in the mrna_assignment column. After generating that file, I used makeDBPackage from the AnnotationForge package to make the clariomdhumantranscriptome.db package. Part of that process involves mapping the GenBank and RefSeq IDs to NCBI Gene IDs, which is probably where the differences arose. IIRC, the last time we actually generated these packages was maybe 2015 or so, and in the intervening period have been simply updating the version number. This is mainly due to the fact that A.) Affy have not updated their files since 2013, and B.) almost nobody uses microarrays any longer. So our efforts have been directed towards more modern methods.

Looking back at the code I used, I originally parsed out the NCBI Gene IDs from the gene_assignment column of the csv file, but then switched to using the mrna_assignment column. To use the gene_assignment column you could get the csv file from fisher and use this code:

parseCsvFiles <- function(csv, fname){
    dat <- read.csv(csv, comment.char = "#", stringsAsFactors=FALSE, na.string = "---")
    if(!all(c(rna, dna) %in% names(dat)))
        stop("Check the headers for file", csv, "they don't include", rna, "and", dna, "!")
    egs <- lapply(strsplit(dat[,dna], " /// "), function(x) sapply(strsplit(x, " // "), function(y), y[length(y)]))
    egs <- lapply(egs, function(x) x[!duplicated(x) & x != "---"])

    egs <- data.frame(probeids = rep(dat[,1], sapply(egs, length)), egids = unlist(egs))

    ## add back missing probesets
    toadd <- data.frame(probeids = dat[!dat[,1] %in% egs[,1],1], egids = rep(NA, sum(!dat[,1] %in% egs[,1])))
    egs <- rbind(egs, toadd)
    write.table(egs, fname, sep = "\t", na = "", row.names = FALSE, col.names = FALSE, quote = FALSE)
}

library(BiocManager)
install("human.db0")
parseCsvFiles(" Clariom_D_Human.na36.hg38.transcript.csv", "text.txt")
library(AnnotationForge)
makeDBPackage("HUMANCHIP_DB", affy = FALSE, prefix = "clariomdhumantranscriptcluster", fileName = "text.txt", baseMapType = "eg", version = "0.0.1", manufacturer = "Affymetrix", chipName = "clariomdhuman")
install.packages("clariomdhumantranscriptcluster.db", repos = NULL) ## if you are on windows, add  type = "source"

I didn't test that code so caveat emptor. You may need to play around with it, but it should be close.