Entering edit mode
Perry Moerland
▴
130
@perry-moerland-1109
Last seen 2.7 years ago
Bioinformatics Laboratory, Academic Med…
Hi Mark,
Thanks for your detailed reply! Are you planning to persuade them to
rerun the Perl script with the last version of UniGene for the next
release of the illuminaHumanv4.db package (and the other Illunmina re-
annotation packages)?
best,
Perry
---
Perry Moerland, PhD
Room J1B-215
Bioinformatics Laboratory, Department of Clinical Epidemiology,
Biostatistics and Bioinformatics
Academic Medical Center, University of Amsterdam
Postbus 22660, 1100 DD Amsterdam, The Netherlands
tel: +31 20 5666945
p.d.moerland@amc.uva.nl, http://www.bioinformaticslaboratory.nl/
From: Mark Dunning [mailto:mark.dunning@gmail.com]
Sent: Thursday, November 21, 2013 12:35 PM
To: P.D. Moerland
Cc: bioconductor@r-project.org
Subject: Re: inconsistency in illuminaHumanv4.db?
Hi Perry,
Sorry for the delay in responding. I should explain that the
annotation packages that we provide are built upon the results of an
in-house Perl script (described in Barbosa-Morais et al) where we map
probes to the genome and transcriptome separately and collate the
results. As you point out, the resources used are not well-documented
so it took a while to get the relevant information from the people
that actually run the script. We hope to improve on this for future
releases. As for your query, we were essentially using an old version
of Unigene for cross-referencing.
The last time the Perl script was run in September 2011, we used
UniGene v230 which had an entry Hs.466662 with the gene symbol
C1orf151 and Entrez gene ID 440574. The way these get associated with
Illumina probes is through sequence cross-references in the UniGene
entry. So, for example, the top BLAST hit in RefSeq for the first
probe, ILMN_2064311, was the transcript NM_001204083 which is one of
the cross-references in the UniGene record. The second of the four
probes matches the same transcript while the other two match another
RefSeq transcript, NM_001204089, that is also among the cross-
reference sequences in the same UniGene record.
In the current version of UniGene (v236) the gene symbol for that same
record is now MINOS1. It still contains the same RefSeq transcript
links so assuming those still came up as the top BLAST hits for these
probes then we would still end up with all having the same Entrez Gene
ID.
The Ensembl gene IDs come directly from the BLAST search against the
Ensembl transcripts.
Mark
On Fri, Nov 15, 2013 at 8:43 PM, P.D. Moerland
<p.d.moerland@amc.uva.nl<mailto:p.d.moerland@amc.uva.nl>> wrote:
Dear all, dear Mark,
I'm a grateful user of the illuminaHumanv4.db annotation package. One
of my collaborators is interested in probes mapping to C1orf151
according to the reannotation provided by the package. However, the
re-annotation for these probes seems inconsistent:
> Illids = get("C1orf151",revmap(illuminaHumanv4SYMBOLREANNOTATED))
> Illids
[1] "ILMN_2064311" "ILMN_1657860" "ILMN_1789599" "ILMN_2405009"
> indx = match(Illids,illuminaHumanv4fullReannotation()[,1])
> tab = illuminaHumanv4fullReannotation()[indx,]
> tab[,c(1,4,11:13,16)]
IlluminaID ProbeQuality EntrezReannotated
GenomicLocation SymbolReannotated EnsemblReannotated
4615 ILMN_2064311 Bad 440574
chr1:19954844:19954893:+ C1orf151 ENSG00000173436
24195 ILMN_1657860 Perfect 440574
chr1:19954399:19954448:+ C1orf151 ENSG00000173436
39363 ILMN_1789599 Perfect 440574
chr1:19984747:19984796:+ C1orf151 ENSG00000158747
46631 ILMN_2405009 Perfect 440574
chr1:19984595:19984644:+ C1orf151 ENSG00000158747
As you can see two probes map to ENSG00000173436 and the other two
probes to ENSG00000158747. This is in agreement with their annotation
on the Ensembl website. The reannotated Entrez Gene ID and the
reannotated symbol, however, seem inconsistent with this. According to
the Ensembl website and according to org.Hs.eg.db the annotation of
the two ENSG IDs is:
> IDs = unlist(mget(tab$EnsemblReannotated,org.Hs.egENSEMBL2EG))
> IDs
ENSG00000173436 ENSG00000173436 ENSG000001587471 ENSG000001587472
ENSG000001587471 ENSG000001587472
"440574" "440574" "4681"
"100532736" "4681" "100532736"
unlist(mget(IDs,org.Hs.egSYMBOL))
440574 440574 4681 100532736
4681 100532736
"MINOS1" "MINOS1" "NBL1" "MINOS1-NBL1" "NBL1"
"MINOS1-NBL1"
Note that C1orf151 is an alias for MINOS1 and that MINOS1 and NBL1 are
neighboring genes on chromosome 1, MINOS-NBL1 is the readthrough
transcript.
How come that illuminaHumanv4.db links all 4 probes to a single Entrez
Gene ID (440574) and a single symbol (C1orf151)? The more general
question is probably, how identifier conversion is performed for the
re-annotation. I tried to find a description in the package
documentation and in Barbosa-Morais et al. (2010) but without success.
best wishes,
Perry
---
Perry Moerland, PhD
Room J1B-215
Bioinformatics Laboratory, Department of Clinical Epidemiology,
Biostatistics and Bioinformatics
Academic Medical Center, University of Amsterdam
Postbus 22660, 1100 DD Amsterdam, The Netherlands
tel: +31 20 5666945<tel:%2b31%2020%205666945>
p.d.moerland@amc.uva.nl<mailto:p.d.moerland@amc.uva.nl>,
http://www.bioinformaticslaboratory.nl/
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United
Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United
Kingdom.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] illuminaHumanv4.db_1.20.0 org.Hs.eg.db_2.10.1 RSQLite_0.11.4
DBI_0.2-7
[5] AnnotationDbi_1.24.0 Biobase_2.22.0
BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] AnnotationForge_1.4.0 IRanges_1.20.4 stats4_3.0.2
________________________________
AMC Disclaimer : http://www.amc.nl/disclaimer
________________________________
________________________________
AMC Disclaimer : http://www.amc.nl/disclaimer
________________________________
[[alternative HTML version deleted]]