Question

several gene names for a probeID in affymetrix annotation file

0

Entering edit mode

nazaninhoseinkhan • 0

@nazaninhoseinkhan-7443

Last seen 6.5 years ago

Iran, Islamic Republic Of

Dear all,

I am trying to map geneIDs from annotation file of S.aureus to probeIDs.

The problem is for over 2000 of rows there are more than 2 geneIDs for corresponding probeID in a row.

Here is the row number of 3544 of annotation file that I put as example:

sa_i10207dr_x_at

1120534 // gi|1120534|ref|NC_002758.2|NC_002758.2(GI:57634611):629461-632324(+) Staphylococcus aureus subsp. aureus Mu50, GENE=sdrC LOCUS=SAV0561 // ncbi_bacterial // 13 // --- /// 1120535 // gi|1120535|ref|NC_002758.2|NC_002758.2(GI:57634611):632689-636848(+) Staphylococcus aureus subsp. aureus Mu50, GENE=sdrD LOCUS=SAV0562 // ncbi_bacterial // 116 // --- /// 1120536 // gi|1120536|ref|NC_002758.2|NC_002758.2(GI:57634611):637240-640667(+) Staphylococcus aureus subsp. aureus Mu50, GENE=sdrE LOCUS=SAV0563 // ncbi_bacterial // 27 // --- /// 1122655 // gi|1122655|ref|NC_002758.2|NC_002758.2(GI:57634611):2782009-2784642(-) Staphylococcus aureus subsp. aureus Mu50, GENE=clfB PRODUCT=Clumping factor B LOCUS=SAV2630 // ncbi_bacterial // 14 // --- /// 1123324 // gi|1123324|ref|NC_002745.2|NC_002745.2(GI:29165615):605214-608077(+) Staphylococcus aureus subsp. aureus N315, GENE=sdrC LOCUS=SA0519 // ncbi_bacterial // 13 // --- /// 1123325 // gi|1123325|ref|NC_002745.2|NC_002745.2(GI:29165615):608442-612601(+) Staphylococcus aureus subsp. aureus N315, GENE=sdrD LOCUS=SA0520 // ncbi_bacterial // 116 // --- /// 1123326 // gi|1123326|ref|NC_002745.2|NC_002745.2(GI:29165615):612993-616420(+) Staphylococcus aureus subsp. aureus N315, GENE=sdrE LOCUS=SA0521 // ncbi_bacterial // 28 // --- /// 1125352 // gi|1125352|ref|NC_002745.2|NC_002745.2(GI:29165615):2718295-2720928(-) Staphylococcus aureus subsp. aureus N315, GENE=clfB PRODUCT=Clumping factor B LOCUS=SA2423 // ncbi_bacterial // 14 // --- /// 3236072 // gi|3236072|ref|NC_002951.2|NC_002951.2(GI:57650036):635788-639935(+) Staphylococcus aureus subsp. aureus COL, GENE=sdrD PRODUCT=sdrD protein LOCUS=SACOL0609 // ncbi_bacterial // 164 // --- /// 3236073 // gi|3236073|ref|NC_002951.2|NC_002951.2(GI:57650036):640327-643829(+) Staphylococcus aureus subsp. aureus COL, GENE=sdrE PRODUCT=sdrE protein LOCUS=SACOL0610 // ncbi_bacterial // 34 // --- /// 3236353 // gi|3236353|ref|NC_002951.2|NC_002951.2(GI:57650036):632578-635423(+) Staphylococcus aureus subsp. aureus COL, GENE=sdrC PRODUCT=sdrC protein LOCUS=SACOL0608 // ncbi_bacterial // 13 // --- /// 3237041 // gi|3237041|ref|NC_002951.2|NC_002951.2(GI:57650036):2711036-2713777(-) Staphylococcus aureus subsp. aureus COL, GENE=clfB PRODUCT=clumping factor B LOCUS=SACOL2652 // ncbi_bacterial // 15 // ---

I want to know is it correct if I consider only the first geneID in each row?

I will appreciate any advice

Nazanin

affy annotation • 899 views

ADD COMMENT • link updated 9.7 years ago by James W. MacDonald 67k • written 9.7 years ago by nazaninhoseinkhan • 0

score 0 · Answer 1 · 2015-03-16

There is a difference between being correct and doing something reasonable. In other words, there can be multiple reasons why a particular probeset is thought to interrogate more than one gene. For technical reasons it might not have been possible to design a probeset that will reliably distinguish between two genes. Or there might be duplication in the annotation databases (e.g., there is just one gene, but two groups discovered it, and gave it two different names, and the annotation folks haven't figured that out yet and fixed it). Or maybe one is a pseudogene. Or something else. If you search on NC_002758 at NCBI, you come up with almost 3000 hits, and a quick peak indicates that there are tons of discontinued genes, so who knows what the truth is.

But the point here is that you cannot really go and figure out what the issue is with each of these multiple-mapping probesets. And there are any number of arguments concerning what should be done with them. In the past, the annotation tools would just return NA for this sort of thing. And the more current annotation tools will return all the mapping results with a warning.

You could argue that choosing just one gene is a good idea (or you could concatenate all of them, so there is no loss of information). If you decide that you just want one gene, then you have to decide which one. You could choose the first one, on the assumption that it is somehow the 'best' one. Or you could randomly choose one of the genes, on the assumption that they are not ordered from best to worst.

So I guess that is my long-winded way of saying that I don't know, and nobody else does either. You just have to decide for yourself what you think is a reasonable thing to do, and then do it. I tend to take the first one, based on some small amount of checking that leads me to believe that Affy orders these things from best to worst. But I am most assuredly biased, and it might just be me being lazy.