How to find stop codons for a list of genes
1
0
Entering edit mode
@mgierlinski-7369
Last seen 4 months ago
United Kingdom

I am looking for an enrichment of a certain motif around stop codon for a selection of genes. First, I need to find position of a stop codon(s) for each gene. I was naively expecting that the stop codon would be at the end of the last exon in a gene. It doesn't look like this. Here is an example. First, I extract some attributes from biomaRt for a couple of genes (my list contains ~200 genes, these two are only an example). For simplicity, these genes are on the forward strand.

library(biomaRt)
library(BSgenome.Hsapiens.NCBI.GRCh38)
library(tidyverse)

genes <- c("ENSG00000065989", "ENSG00000003989")
mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "110")
attribs <- c("ensembl_transcript_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end")
bm <- getBM(mart = mart, attributes = attribs, filters = "ensembl_gene_id", values = genes)

I have genomic co-ordinates of each exon:

head(bm)
  ensembl_transcript_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end
1       ENST00000494857 ENSE00001826862               8         17497088       17497237
2       ENST00000494857 ENSE00001841976               8         17502257       17502302
3       ENST00000494857 ENSE00003483589               8         17543318       17543715
4       ENST00000494857 ENSE00000457144               8         17544451       17544606
5       ENST00000494857 ENSE00000457145               8         17548678       17548843
6       ENST00000494857 ENSE00000924832               8         17550301       17550434

Next, I find the end of the last exon for each transcript and create small regions 20 bases upstream:

regions <- bm |> 
  as_tibble() |> 
  group_by(ensembl_transcript_id, chromosome_name) |> 
  summarise(end = max(exon_chrom_end)) |> 
  mutate(start = end - 20, .before = end) |>
  arrange(end)

Now, I extract sequence for each region:

seqs <- regions |> 
  rowwise() |> 
  mutate(seq = getSeq(Hsapiens, as.character(chromosome_name), start, end) |> as.character())

Here is the result:

# A tibble: 15 × 5
# Rowwise:  ensembl_transcript_id
   ensembl_transcript_id chromosome_name    start      end seq                  
   <chr>                           <int>    <dbl>    <int> <chr>                
 1 ENST00000586275                    19 10459710 10459730 CTGACGTGCTGCAGTCCACCC
 2 ENST00000591971                    19 10461909 10461929 CCTCCTGGCTGACCTGAAGAC
 3 ENST00000589073                    19 10467345 10467365 GTTCTCGTCCCGGGAGGAATT
 4 ENST00000590407                    19 10467346 10467366 TTCTCGTCCCGGGAGGAATTC
 5 ENST00000293683                    19 10467601 10467621 TCAGGTGGAGACCCTACCTGA
 6 ENST00000440014                    19 10467601 10467621 TCAGGTGGAGACCCTACCTGA
 7 ENST00000592685                    19 10467629 10467649 ACCTCTGTCCCTGTTCCCCTC
 8 ENST00000344979                    19 10467911 10467931 AAAAAAAAAGAAAGAAACACA
 9 ENST00000380702                    19 10469610 10469630 TTAAACCGAGGCTTTCACCGA
10 ENST00000640220                     8 17565142 17565162 TCTAACACTTGCAGGAGCAGA
11 ENST00000522656                     8 17565254 17565274 CATAGTTCACCCTAATTTATA
12 ENST00000004531                     8 17570541 17570561 TTAAAATAAATTTTAATATGT
13 ENST00000398090                     8 17570541 17570561 TTAAAATAAATTTTAATATGT
14 ENST00000494857                     8 17570546 17570566 ATAAATTTTAATATGTAATAA
15 ENST00000470360                     8 17570553 17570573 TTAATATGTAATAATATAGAA

Only a few of them end with TAA or TGA. Also, the last exon in the gene (results are ordered by end) does not end with the stop codon.

Obviously, I don't understand what I am doing. Where is the stop codon? How to find genomic co-ordinates of stop codons for a list of genes?

stopcodon biomaRt • 1.4k views
ADD COMMENT
1
Entering edit mode
ATpoint ★ 4.5k
@atpoint-13662
Last seen 14 hours ago
Germany

Start and stop codons are annotated in GTF files from Ensembl (https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/) so you can simply read this with rtracklayer::import and then subset the GTF for type=="stop_codon" to get coordinates followed by the getSeq part.

The stop codon is the last three bases of the coding sequence, not of the exon per transcript, as there are untranslated exons (5'- and 3' UTRs) that extend beyond the coding sequence.

enter image description here

ADD COMMENT
0
Entering edit mode

Thank you very much! I use Ensembl's GTF files for mapping, so I have it already on my disk. Didn't realise it contained the exact information I need.

Also, didn't realise there are untranslated exons. This is new for me.

ADD REPLY

Login before adding your answer.

Traffic: 539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6