Question: Easy way to fetch the first exons from genome
1
0
Entering edit mode
XIN1988 • 0
@xin1988-22335
Last seen 4.8 years ago

Hi I am wondering which way could allow me to retrieve the first exons from genome with the information of the exon length. the purpose is to retrieve the first exons for gRNA design. Thanks, Xin

annotation • 1.7k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 13 hours ago
Seattle, WA, United States

Hi,

How about extracting the exons from a TxDb object and pick up the "first exons"? (whatever that means):

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
ex <- exons(txdb)
ex
# GRanges object with 76920 ranges and 1 metadata column:
#            seqnames        ranges strand |   exon_id
#               <Rle>     <IRanges>  <Rle> | <integer>
#       [1]     chr2L     7529-8116      + |         1
#       [2]     chr2L     8193-8589      + |         2
#       [3]     chr2L     8193-9484      + |         3
#       [4]     chr2L     8229-9484      + |         4
#       [5]     chr2L     8668-9484      + |         5
#       ...       ...           ...    ... .       ...
#   [76916]   chrYHet 327915-328204      - |     76916
#   [76917]   chrYHet 328270-328489      - |     76917
#   [76918] chrUextra 523024-523048      - |     76918
#   [76919] chrUextra 523024-523086      - |     76919
#   [76920] chrUextra 523060-523086      - |     76920
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome

The exons returned by exons() are guaranteed to be sorted first by chromosome, then by strand, then by ascending position on the chromosome.

H.

ADD COMMENT
0
Entering edit mode

Hi Hervé, Thank you very much. Sorry for the unclear description. Actually, I want to retrieve the first exons for each gene where RNA polymerase begins transcribing. The reason that I will use the retrieved exon for CRISPR gRNA design as targeting on these regions would give high efficient knockout.

ADD REPLY
1
Entering edit mode

A gene has one TSS per transcript. You can easily get the 1st exon of each known transcript by extracting the exons grouped by transcript:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
ex_by_tx <- exonsBy(txdb, by="tx")
ex_by_tx
# GRangesList object of length 29173:
# $`1`
# GRanges object with 2 ranges and 3 metadata columns:
#       seqnames    ranges strand |   exon_id   exon_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr2L 7529-8116      + |         1        <NA>         1
#   [2]    chr2L 8193-9484      + |         3        <NA>         2
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome
#
# $`2`
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames    ranges strand |   exon_id   exon_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr2L 7529-8116      + |         1        <NA>         1
#   [2]    chr2L 8193-8589      + |         2        <NA>         2
#   [3]    chr2L 8668-9484      + |         5        <NA>         3
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome
#
# ...
# <29171 more elements>

ex_by_tx is a named GRangesList object with 1 GRanges object per known transcript. The names on ex_by_tx are the transcript internal ids (these ids are internal to the TxDb object and have no meaning outside it). Each GRanges object in ex_by_tx contains the exons for the corresponding transcript. The exons are ordered by rank with respect to their transcript. See ?exonsBy for more information.

To keep the 1st exon for each transcript we can use heads(), a version of head() that works on a list-like object and keeps the first n elements within each list element (note that heads() is a Bioconductor extension and is not part of base R):

first_exons <- heads(ex_by_tx, n=1)
# GRangesList object of length 29173:
# $`1`
# GRanges object with 1 range and 3 metadata columns:
#       seqnames    ranges strand |   exon_id   exon_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr2L 7529-8116      + |         1        <NA>         1
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome
#
# $`2`
# GRanges object with 1 range and 3 metadata columns:
#       seqnames    ranges strand |   exon_id   exon_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr2L 7529-8116      + |         1        <NA>         1
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome
#
# ...
# <29171 more elements>

We can turn this into a GRanges object with unlist():

first_exons <- unlist(first_exons, use.names=FALSE)
mcols(first_exons) <- DataFrame(tx_id=as.integer(names(ex_by_tx)))

and add some useful metadata columns to it:

tx_lens <- transcriptLengths(txdb)
stopifnot(identical(mcols(first_exons)$tx_id, tx_lens$tx_id))
colnames(tx_lens)
# [1] "tx_id"   "tx_name" "gene_id" "nexon"   "tx_len" 
mcols(first_exons) <- tx_lens[ , c(2, 3, 5)]  # keeping only cols 2, 3, 5
first_exons
# GRanges object with 29173 ranges and 3 metadata columns:
#         seqnames        ranges strand|     tx_name     gene_id    tx_len
#            <Rle>     <IRanges>  <Rle>| <character> <character> <integer>
#    [1]     chr2L     7529-8116      +| FBtr0300689 FBgn0031208      1880
#    [2]     chr2L     7529-8116      +| FBtr0300690 FBgn0031208      1802
#    [3]     chr2L     7529-8116      +| FBtr0330654 FBgn0031208      1844
#    ...       ...           ...    ....         ...         ...       ...
#[29171] chrUextra 523024-523048      -| FBtr0330363 FBgn0264003        25
#[29172] chrUextra 523024-523086      -| FBtr0330361 FBgn0264003        63
#[29173] chrUextra 523060-523086      -| FBtr0330362 FBgn0264003        27
#-------
#seqinfo: 15 sequences (1 circular) from dm3 genome

This final version of first_exons contains the genomic location of the 1st exon of each known transcript and each exon is mapped to the corresponding transcript name, gene name, and transcript length.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

I should add that the fact that some genomic ranges are repeated multiple times in first_exons (e.g. the first 3 exons are the same) simply reflects the fact that several transcripts in a gene can share the same first exon. So if you're interested in knowing the location of the first exon for each gene, you can start by getting rid of the repeated exons with:

first_exons <- unique(first_exons)
## Dropping tx_name and tx_len metadata columns since they are no
## longer that meaningful:
mcols(first_exons)$tx_name <- mcols(first_exons)$tx_len <- NULL
first_exons
# GRanges object with 22490 ranges and 1 metadata column:
#            seqnames        ranges strand |     gene_id
#               <Rle>     <IRanges>  <Rle> | <character>
#       [1]     chr2L     7529-8116      + | FBgn0031208
#       [2]     chr2L   21952-22941      + | FBgn0263584
#       [3]     chr2L   66584-66612      + | FBgn0067779
#       [4]     chr2L   67043-67507      + | FBgn0067779
#       [5]     chr2L   67043-67111      + | FBgn0067779
#       ...       ...           ...    ... .         ...
#   [22486]   chrYHet 320942-320997      - | FBgn0085791
#   [22487]   chrYHet 328270-328489      - | FBgn0085792
#   [22488] chrUextra 523024-523048      - | FBgn0264003
#   [22489] chrUextra 523024-523086      - | FBgn0264003
#   [22490] chrUextra 523060-523086      - | FBgn0264003
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome

Even after doing this first_exons can still contain more than one "first exon" per gene (e.g. gene FBgn0067779 contains 4 "first exons"). To keep only one "first exon" per gene, I guess the natural thing to do is to go after "the most upstream first exon" for each gene, that is, the exon with the smaller start position for genes on the + strand and the exon with the greatest end position for genes on the minus strand. Here is one way to do this:

first_ex_per_gene <- split(first_exons, mcols(first_exons)$gene_id)
gene_strand <- as.character(unique(strand(first_ex_per_gene)))
idx <- ifelse(gene_strand == "+",
              which.min(start(first_ex_per_gene)),
              which.max(end(first_ex_per_gene)))
very_first_exon_in_gene <- unlist(first_ex_per_gene[as.list(idx)],
                                  use.names=FALSE)
very_first_exon_in_gene
# GRanges object with 15558 ranges and 1 metadata column:
#           seqnames            ranges strand |     gene_id
#              <Rle>         <IRanges>  <Rle> | <character>
#       [1]    chr3R   2648220-2648518      + | FBgn0000003
#       [2]    chr2R 18024494-18024531      + | FBgn0000008
#       [3]    chr3R 12654643-12655767      - | FBgn0000014
#       [4]    chr3R 12797481-12797958      - | FBgn0000015
#       [5]    chr3L 16639910-16640982      - | FBgn0000017
#       ...      ...               ...    ... .         ...
#   [15554]    chr3L 12238610-12239148      - | FBgn0264723
#   [15555]    chr3L 15327882-15327984      + | FBgn0264724
#   [15556]    chr3L 12025657-12025739      + | FBgn0264725
#   [15557]    chr3L 12020901-12021253      + | FBgn0264726
#   [15558]    chr3L 22065469-22065720      + | FBgn0264727
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome

H.

ADD REPLY
0
Entering edit mode

One more thing (and I'll stop replying to myself).

The very first exon in each gene can actually be obtained more simply with:

ex_by_gene <- exonsBy(txdb, by="gene")
gene_strand <- as.character(unique(strand(ex_by_gene)))
idx <- ifelse(gene_strand == "+",
              which.min(start(ex_by_gene)),
              which.max(end(ex_by_gene)))
very_first_exon_in_each_gene <- unlist(ex_by_gene[as.list(idx)],
                                       use.names=FALSE)
mcols(very_first_exon_in_each_gene) <-
    DataFrame(gene_id=names(ex_by_gene))
very_first_exon_in_each_gene
# GRanges object with 15682 ranges and 1 metadata column:
#           seqnames            ranges strand |     gene_id
#              <Rle>         <IRanges>  <Rle> | <character>
#       [1]    chr3R   2648220-2648518      + | FBgn0000003
#       [2]    chr2R 18024494-18024531      + | FBgn0000008
#       [3]    chr3R 12654643-12655767      - | FBgn0000014
#       [4]    chr3R 12797481-12797958      - | FBgn0000015
#       [5]    chr3L 16639910-16640982      - | FBgn0000017
#       ...      ...               ...    ... .         ...
#   [15678]    chr3L 12238610-12239148      - | FBgn0264723
#   [15679]    chr3L 15327882-15327984      + | FBgn0264724
#   [15680]    chr3L 12025657-12025739      + | FBgn0264725
#   [15681]    chr3L 12020901-12021253      + | FBgn0264726
#   [15682]    chr3L 22065469-22065720      + | FBgn0264727
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome

This method is actually doing the right thing if the goal is to extract the most upstream exon for each gene in the TxDb object. Unlike in very_first_exon_in_gene, all the known genes (i.e. all the genes in the TxDb object) are represented in very_first_exon_in_each_gene.

The previous method was loosing some genes along the way due to the fact that some genes can share the same most upstream exon and that we were calling unique() at some point. For example genes FBgn0000055 and FBgn0000056 share the same most upstream exon:

very_first_exon_in_each_gene[24:25]
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames            ranges strand |     gene_id
#          <Rle>         <IRanges>  <Rle> | <character>
#   [1]    chr2L 14615555-14615643      + | FBgn0000055
#   [2]    chr2L 14615555-14615643      + | FBgn0000056
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome

which is why FBgn0000056 didn't make it to very_first_exon_in_gene:

"FBgn0000055" %in% mcols(very_first_exon_in_gene)$gene_id
# [1] TRUE
"FBgn0000056" %in% mcols(very_first_exon_in_gene)$gene_id
# [1] FALSE

H.

ADD REPLY
0
Entering edit mode

Thank you very much, Hervé. Really helpful!

ADD REPLY
0
Entering edit mode

Hi Hervé, I am trying to design the genome-wide gRNA libraries for several different species. One paper described a pipeline for the design. Basically, they first select the exonic guide sites fitting the pattern G(N16–19)NGG, and then annotate these guide sites as targeting Ensembl GRCh37 genes models to generate candidate guides towards each gene. which R package would you recommend for this purpose. Thanks

ADD REPLY
0
Entering edit mode

I don't know. This sounds like is a slightly different topic though so I would recommend that maybe you try to ask this as a new question.

ADD REPLY

Login before adding your answer.

Traffic: 880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6