Question

BSgenomes and protein sequences: protein names

0

Entering edit mode

Zybaylov, Boris L ▴ 30

@zybaylov-boris-l-5212

Last seen 10.1 years ago

Dear Valerie, Thank you for your answer! Using your script I got now the list of translated transcripts that contain the amino acid patterns I am interested in. Here are their names:> names(cds_seqs[i]) [1] "uc001ack.2" "uc001acv.3" "uc001adm.3" "uc001ado.3" "uc001adp.3" "uc001adq.3" "uc001adr.3" "uc001aee.1" "uc001aef.1" [10] "uc009vjz.1" "uc010nyj.1" "uc001aer.4" "uc009vkz.1" "uc010nyz.2" "uc001aji.1" "uc009vle.1" "uc001ajj.1" "uc001ajk.1" [19] "uc001ajy.2" The question now is how do I go from these names to conventional protein names and (or) ENTREZ identifiers? Thank you very much for your help! Best regards, Boris Dr. Boris Zybaylov Instructor Department of Biochemistry and Molecular Biology University of Arkansas Medical Sciences Little Rock, AR 1-501-686-7254 Confidentiality Notice: This e-mail message, including a...{{dropped:10}}

GO GO • 757 views

ADD COMMENT • link updated 12.4 years ago by Valerie Obenchain ★ 6.8k • written 12.4 years ago by Zybaylov, Boris L ▴ 30

score 0 · Answer 1 · 2012-05-08

Hi Boris, The select() function in AnnotationDbi can help you map between identifiers. See ?select Load the org package for the appropriate organism, library(org.Hs.eg.db) Theses are the available keytypes in the org package and are the possible forms of your input. > keytypes(org.Hs.eg.db) [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "ENZYME" [6] "MAP" "OMIM" "PATH" "PMID" "REFSEQ" [11] "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" [16] "UNIPROT" "UCSCKG" "GO" These are the columns that can be returned from a call to select() and are the possible forms of output. > cols(org.Hs.eg.db) [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "ENZYME" [6] "GENENAME" "MAP" "OMIM" "PATH" "PMID" [11] "REFSEQ" "SYMBOL" "UNIGENE" "CHRLOC" "CHRLOCEND" [16] "PFAM" "PROSITE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" [21] "UNIPROT" "UCSCKG" "GO" The gene id's in the TxDb.Hsapiens.UCSC.hg19.knownGene pacakge are Entrez id's. This is shown in the 'Type of Gene ID' field, > txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: UCSC | Genome: hg19 | Genus and Species: Homo sapiens | UCSC Table: knownGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: yes | miRBase build ID: GRCh37 | transcript_nrow: 80922 | exon_nrow: 286852 | cds_nrow: 235842 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012) | GenomicFeatures version at creation time: 1.7.30 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 Create a map between your tx names and the Entrez gene id's. It is possible that none or multiple gene id's will be associated with each transcript. You can use the transcripts accessor to get the transcript-gene relationship, library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx <- transcripts(txdb, col=c("tx_name", "gene_id")) txnames <- values(tx)[["tx_name"]] geneid <- values(tx)[["gene_id"]] df <- DataFrame(tx_name=rep(txnames, elementLengths(geneid)), gene_id=unlist(geneid, use.names=FALSE)) Get the gene id's for transcripts of interest entrezid <- df$gene_id[df$tx_name %in% df$tx_name[1:3]] Map the Entrez gene id's to various protein id's, > select(org.Hs.eg.db, keys=entrezid, cols=c("PMID", "PFAM", "ENSEMBLPROT", "UNIPROT"), keytype="ENTREZID") ENTREZID PMID IPI PfamId ENSEMBLPROT UNIPROT 1 79501 16710414 IPI00169105 PF00001 ENSP00000334393 Q8NH21 2 79501 16710414 IPI01010102 PF00001 ENSP00000334393 Q8NH21 3 100133331 16751776 <na> <na> <na> <na> 4 100132287 16751776 <na> <na> <na> <na> If you are not working with the TxDb.Hsapiens.UCSC.hg19.knownGene annotation you may be starting with different identifiers (e.g.., the gene id's may not be Entrez). You may also want to check out the PFAM.db package, biocLite("PFAM.db") Valerie On 05/07/2012 01:14 PM, Zybaylov, Boris L wrote: > > Dear Valerie, Thank you for your answer! > > Using your script I got now the list of translated transcripts that > contain the amino acid patterns I am interested in. > > Here are their names:> names(cds_seqs[i]) [1] "uc001ack.2" > "uc001acv.3" "uc001adm.3" "uc001ado.3" "uc001adp.3" "uc001adq.3" > "uc001adr.3" "uc001aee.1" "uc001aef.1" [10] "uc009vjz.1" "uc010nyj.1" > "uc001aer.4" "uc009vkz.1" "uc010nyz.2" "uc001aji.1" "uc009vle.1" > "uc001ajj.1" "uc001ajk.1" [19] "uc001ajy.2" > > The question now is how do I go from these names to conventional > protein names and (or) ENTREZ identifiers? > > Thank you very much for your help! > > Best regards, > > Boris > > Dr. Boris Zybaylov > Instructor > Department of Biochemistry and Molecular Biology > University of Arkansas Medical Sciences > Little Rock, AR > 1-501-686-7254 > > Confidentiality Notice: This e-mail message, including any attachments, > is for the sole use of the intended recipient(s) and may contain > confidential and privileged information. Any unauthorized review, > use, disclosure or distribution is prohibited. If you are not the > intended recipient, please contact the sender by reply > e-mail and destroy all copies of the original message.. > [[alternative HTML version deleted]]