biomaRt: retrieve exon sequence, start and end positions

1

Entering edit mode

Tim Smith ★ 1.1k

@tim-smith-1532

Last seen 10.6 years ago

Hi, I would like to retrieve the exon sequences (i.e. 5'UTR + CDS + 3'UTR) for a gene, alongwith the start and end positions for each exon. My short script is: ========= library(biomaRt) ## Example gene: MTOR; ensembl id "ENSG00000198793" mySequence <- getSequence(id="ENSG00000198793",type="ensembl_gene_id", seqType="gene_exon",mart=ensembl) gb <- getBM(attributes=c('ensembl_exon_id', "exon_chrom_start","exon_chrom_end"), filters = "ensembl_gene_id", values="ENSG00000198793", mart=ensembl) > print(dim(seq)) [1] 70 2 > print(dim(gb)) [1] 79 3 ====== Should I be doing something else? There seem to be more exons(i.e. 79) and less sequences that were retrieved (i.e.70). Ideally my output would have the following columns. ENSEMBL_ID EXON_ID EXON_START EXON_END EXON_SEQUENCE thanks! [[alternative HTML version deleted]]

• 8.2k views

ADD COMMENT • link updated 11.3 years ago by Steffen Durinck ▴ 540 • written 11.3 years ago by Tim Smith ★ 1.1k

0

Entering edit mode

Steffen Durinck ▴ 540

@steffen-durinck-4465

Last seen 10.6 years ago

Hi Tim, Not sure why you get less sequences with the getSequence query, you can add gene_exon to your getBM query though and get the sequence for all 79 by: gb <- getBM(attributes=c('ensembl_exon_id', "exon_chrom_start","exon_chrom_end","gene_exon"), filters = "ensembl_gene_id", values="ENSG00000198793", mart=ensembl, bmHeader=TRUE) Cheers, Steffen On Fri, Jan 10, 2014 at 7:46 AM, Tim Smith <tim_smith_666@yahoo.com> wrote: > Hi, > > I would like to retrieve the exon sequences (i.e. 5'UTR + CDS + 3'UTR) for > a gene, alongwith the start and end positions for each exon. My short > script is: > > ========= > > library(biomaRt) > > ## Example gene: MTOR; ensembl id "ENSG00000198793" > mySequence <- > getSequence(id="ENSG00000198793",type="ensembl_gene_id",seqType="gen e_exon",mart=ensembl) > > gb <- getBM(attributes=c('ensembl_exon_id', > "exon_chrom_start","exon_chrom_end"), filters = "ensembl_gene_id", > values="ENSG00000198793", mart=ensembl) > > > > print(dim(seq)) > [1] 70 2 > > print(dim(gb)) > [1] 79 3 > > ====== > > Should I be doing something else? > > > There seem to be more exons(i.e. 79) and less sequences that were > retrieved (i.e.70). Ideally my output would have the following columns. > > > ENSEMBL_ID EXON_ID EXON_START EXON_END EXON_SEQUENCE > > > thanks! > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.3 years ago Steffen Durinck ▴ 540

Login before adding your answer.