Question

R: Why am I finding a mismatch between refseq_dna and ensembl_transcript_id ?

0

Entering edit mode

mauede@alice.it ▴ 870

@mauedealiceit-3511

Last seen 10.2 years ago

Actually I extracted the same information the old way, that is using a loop which provided one refseq_dna at a time. I know thsi is not expected with a high-level language like R. However i could see that some ENST correspond to two different HGNC symbols. Moreover the 3utr sequence is not available for all ENSTs I have. Thank you for your answer. Regards, Maura -----Messaggio originale----- Da: Sean Davis [mailto:seandavi@gmail.com] Inviato: mer 29/07/2009 7.46 A: mauede@alice.it Cc: Bioconductor List Oggetto: Re: [BioC] Why am I finding a mismatch between refseq_dna and ensembl_transcript_id ? On Wed, Jul 29, 2009 at 12:01 AM, <mauede@alice.it> wrote: > I downloaded the following file from miRDB > http://mirdb.org/miRDB/download/MirTarget2_v3.0_prediction_result.tx t.gz > > I have checked that miRDB Gene_Bank_Accession_Number (for Human it is > something like NM_xxxxx) correspond to BioMart "refseq_dna". > > I have a vector containing 253 Gene_Bank_Accession_Numbers > length(tmp_miRNA_GB) > [1] 253 > > tmp_miRNA_GB[1:5] > [1] "NM_203390" "NM_024639" "NM_001017989" "NM_203331" "NM_001879" > > I use such a vectos as input filter to getBM to obtain the respective > ensembl_transcript_id. > Surprisingly onlly 246 ensembl_transcript_ids are found: > > > gene.map <- getBM (attributes = > c("hgnc_symbol","ensembl_gene_id","refseq_dna","ensembl_transcript_i d"), > filters = "refseq_dna", values = > tmp_miRNA_GB, mart=hmart) > > > dim(gene.map) > [1] 246 4 > > I thought there would be a 1-1 correspondence between the two attributes: > "refseq_dna" and "ensembl_transcript_id" > Am I mistaken ? > Hi, Maura. Yes, unfortunately, there is not a 1-1 correspondence. Ensembl and NCBI (the curator of RefSeq) are independent organizations, each with different build policies and annotation processes for transcripts. So, in general in this field (genomics/bioinformatics), there is RARELY a 1-1 correspondence between any two entities. I would suggest that 246/253 is actually quite a good result--I might have expected a bit less a priori. Sean tutti i telefonini TIM! [[alternative HTML version deleted]]

Annotation biomaRt Annotation biomaRt • 1.1k views

ADD COMMENT • link updated 15.3 years ago by Sean Davis 21k • written 15.3 years ago by mauede@alice.it ▴ 870

score 0 · Answer 1 · 2009-07-28

On Wed, Jul 29, 2009 at 3:35 AM, <mauede@alice.it> wrote: > Actually I extracted the same information the old way, that is using a > loop which provided one refseq_dna at a time. > I know thsi is not expected with a high-level language like R. However i > could see that some ENST correspond to two different > HGNC symbols. Moreover the 3utr sequence is not available for all ENSTs I > have. > Not all transcripts have a 3'utr. If you want to check your code, you can always go to the Ensembl browser to see what it shows for those transcripts for which the 3'utr is missing. Sean > > Thank you for your answer. > Regards, > Maura > > > -----Messaggio originale----- > Da: Sean Davis [mailto:seandavi@gmail.com <seandavi@gmail.com>] > Inviato: mer 29/07/2009 7.46 > A: mauede@alice.it > Cc: Bioconductor List > Oggetto: Re: [BioC] Why am I finding a mismatch between refseq_dna and > ensembl_transcript_id ? > > On Wed, Jul 29, 2009 at 12:01 AM, <mauede@alice.it> wrote: > > > I downloaded the following file from miRDB > > http://mirdb.org/miRDB/download/MirTarget2_v3.0_prediction_result. txt.gz > > > > I have checked that miRDB Gene_Bank_Accession_Number (for Human it is > > something like NM_xxxxx) correspond to BioMart "refseq_dna". > > > > I have a vector containing 253 Gene_Bank_Accession_Numbers > > length(tmp_miRNA_GB) > > [1] 253 > > > tmp_miRNA_GB[1:5] > > [1] "NM_203390" "NM_024639" "NM_001017989" "NM_203331" > "NM_001879" > > > > I use such a vectos as input filter to getBM to obtain the respective > > ensembl_transcript_id. > > Surprisingly onlly 246 ensembl_transcript_ids are found: > > > > > gene.map <- getBM (attributes = > > c("hgnc_symbol","ensembl_gene_id","refseq_dna","ensembl_transcript _id"), > > filters = "refseq_dna", values = > > tmp_miRNA_GB, mart=hmart) > > > > > dim(gene.map) > > [1] 246 4 > > > > I thought there would be a 1-1 correspondence between the two attributes: > > "refseq_dna" and "ensembl_transcript_id" > > Am I mistaken ? > > > > Hi, Maura. > > Yes, unfortunately, there is not a 1-1 correspondence. Ensembl and NCBI > (the curator of RefSeq) are independent organizations, each with different > build policies and annotation processes for transcripts. So, in general in > this field (genomics/bioinformatics), there is RARELY a 1-1 correspondence > between any two entities. I would suggest that 246/253 is actually quite a > good result--I might have expected a bit less a priori. > > Sean > > > > Alice Messenger ;-) chatti anche con gli amici di Windows Live Messenger e > tutti i telefonini TIM! > Vai su http://maileservizi.alice.it/alice_messenger/index.html?pmk=footer > [[alternative HTML version deleted]]