I am looking for an enrichment of a certain motif around stop codon for a selection of genes. First, I need to find position of a stop codon(s) for each gene. I was naively expecting that the stop codon would be at the end of the last exon in a gene. It doesn't look like this. Here is an example. First, I extract some attributes from biomaRt
for a couple of genes (my list contains ~200 genes, these two are only an example). For simplicity, these genes are on the forward strand.
library(biomaRt)
library(BSgenome.Hsapiens.NCBI.GRCh38)
library(tidyverse)
genes <- c("ENSG00000065989", "ENSG00000003989")
mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "110")
attribs <- c("ensembl_transcript_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end")
bm <- getBM(mart = mart, attributes = attribs, filters = "ensembl_gene_id", values = genes)
I have genomic co-ordinates of each exon:
head(bm)
ensembl_transcript_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end
1 ENST00000494857 ENSE00001826862 8 17497088 17497237
2 ENST00000494857 ENSE00001841976 8 17502257 17502302
3 ENST00000494857 ENSE00003483589 8 17543318 17543715
4 ENST00000494857 ENSE00000457144 8 17544451 17544606
5 ENST00000494857 ENSE00000457145 8 17548678 17548843
6 ENST00000494857 ENSE00000924832 8 17550301 17550434
Next, I find the end of the last exon for each transcript and create small regions 20 bases upstream:
regions <- bm |>
as_tibble() |>
group_by(ensembl_transcript_id, chromosome_name) |>
summarise(end = max(exon_chrom_end)) |>
mutate(start = end - 20, .before = end) |>
arrange(end)
Now, I extract sequence for each region:
seqs <- regions |>
rowwise() |>
mutate(seq = getSeq(Hsapiens, as.character(chromosome_name), start, end) |> as.character())
Here is the result:
# A tibble: 15 × 5
# Rowwise: ensembl_transcript_id
ensembl_transcript_id chromosome_name start end seq
<chr> <int> <dbl> <int> <chr>
1 ENST00000586275 19 10459710 10459730 CTGACGTGCTGCAGTCCACCC
2 ENST00000591971 19 10461909 10461929 CCTCCTGGCTGACCTGAAGAC
3 ENST00000589073 19 10467345 10467365 GTTCTCGTCCCGGGAGGAATT
4 ENST00000590407 19 10467346 10467366 TTCTCGTCCCGGGAGGAATTC
5 ENST00000293683 19 10467601 10467621 TCAGGTGGAGACCCTACCTGA
6 ENST00000440014 19 10467601 10467621 TCAGGTGGAGACCCTACCTGA
7 ENST00000592685 19 10467629 10467649 ACCTCTGTCCCTGTTCCCCTC
8 ENST00000344979 19 10467911 10467931 AAAAAAAAAGAAAGAAACACA
9 ENST00000380702 19 10469610 10469630 TTAAACCGAGGCTTTCACCGA
10 ENST00000640220 8 17565142 17565162 TCTAACACTTGCAGGAGCAGA
11 ENST00000522656 8 17565254 17565274 CATAGTTCACCCTAATTTATA
12 ENST00000004531 8 17570541 17570561 TTAAAATAAATTTTAATATGT
13 ENST00000398090 8 17570541 17570561 TTAAAATAAATTTTAATATGT
14 ENST00000494857 8 17570546 17570566 ATAAATTTTAATATGTAATAA
15 ENST00000470360 8 17570553 17570573 TTAATATGTAATAATATAGAA
Only a few of them end with TAA
or TGA
. Also, the last exon in the gene (results are ordered by end
) does not end with the stop codon.
Obviously, I don't understand what I am doing. Where is the stop codon? How to find genomic co-ordinates of stop codons for a list of genes?
Thank you very much! I use Ensembl's GTF files for mapping, so I have it already on my disk. Didn't realise it contained the exact information I need.
Also, didn't realise there are untranslated exons. This is new for me.