Hi,
I'm currently working on HTA 2.0 ST array and would like to do annotation through hta20sttranscriptcluster.db
But I realized the probe ID hta20sttranscriptcluster.db is not the same as the ID in the expression dataset:
> head(featureNames(dataset$eset))
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"
[6] "2827995_st"
> head(keys(hta20sttranscriptcluster.db, keytype = "PROBEID"))
[1] "18670000" "18670001" "18670002" "18670003" "18670004" "18670005"
I was wondering if there's a way to convert between the two IDs in hta20sttranscriptcluster.db and expression dataset?
Thank you!
Sylvia
Hi James,
I just have a follow up question, I was wondering what build of the genome was the annotation based on. Is it hg18 or hg19?
Best,
Sylvia
Annotation data are, for the most part, independent of the genome build. In other words, most data that would normally be considered annotation (various gene, protein or transcript IDs, GO IDs, gene names and symbols, pathway annotations, etc) have more to do with either the function or sequence of the gene or its transcripts and resulting proteins than its genomic location.
For example, if the assigned genomic location for BRCA1 were to move 1 Kb upstream between the hg18 and hg19 builds, the only thing different about that gene would be the chromosomal location data, which are in general better to get from a TxDb object than a ChipDb object anyway. It would still have the same symbol, gene name, Gene ID, Ensembl ID, etc.
In addition, annotation data are not released in build format. Most of the annotation databases at NCBI (and presumably at Ensembl as well) are updated on a weekly basis, so they are in a state of constant flux. We 'freeze' the annotation at six month intervals when we release a new set of annotation packages, but this is a tradeoff we make between being completely updated and allowing people to have reproducible results.
I built the annotation packages in the first week of October 2015, so all the data are based on what was available from the various annotation sites as of that week.
I see, thanks for the quick reply! I was a bit concerned cause some of the same genes in the hta20sttrnascriptcluster.db have different chromosome locations and when I checked NCBI I realized none of the location match to the NCBI chromosome location for that gene. For example, Entrez gene ID: 100873740 have CHRLOCCHR 1 and 14 in hta20sttranscriptcluster.db, however, when I checked NCBI it says RNU6-31P is on chromosome 2. That's why I would like to double check what does the annotation base on.
Best,
Sylvia
Opinions differ on the location of that gene. Note that the hta20sttranscriptcluster.db package really only has a single table, mapping probeset ID to Entrez Gene ID. The rest of the mappings are handled by the org.Hs.eg.db package. So this discrepancy isn't due to anything in the hta20sttranscriptcluster.db package.
Anyway, to show what I am talking about, let's do some SQL queries on the tables in the org.Hs.eg.db package.
So this explains the discrepancy we get from this select() call, that I didn't really understand:
In other words, the 'chromosomes' table says the gene is on chr2, whereas the 'chromosome_locations' table says otherwise. But like I said before, this isn't really the package for doing this sort of thing. We should really be using a TxDb package.
So UCSC is pretty sure that this pseudogene is found on chr1 and chr14, not chr2. There is probably more to this story, but I can't really delve further into it because A.) I don't have the time, really, and B.) the server we used to do the annotation is a virtual server in the cloud, that is not running right now, and I don't have the authoritah to fire it up to check over the code to see exactly where the two (chromosomes and chromosome_location) tables come from. I am pretty sure they are from two different tables we download from NCBI, but I don't know which ones.
I should also note that Bioconductor doesn't make any claim to the accuracy of the data we provide. We are simply re-packaging existing data that we get from the people whose job it is to collect and distribute the data. As with any such large projects that are administered by different teams in different countries, there will inevitably be differences that have to be ironed out over time.
Hi James,
Thank you for your time! I'll take a look at TxDb.
Best,
Sylvia