Question

BSgenome for ensembl mm10

0

Entering edit mode

kyliecode • 0

@kyliecode-14088

Last seen 5 months ago

Sweden

Hi!

I am trying to use R packages which require BSgenome to specify the genome. My datasets are from mouse aligned aginst mm10 from ensembl rather than UCSC. So, am I supposed to forge the BSgenome myself?

Thanks in advance!

Kylie

BSgenomeForge BSgenome • 1.9k views

ADD COMMENT • link updated 18 months ago by James W. MacDonald 67k • written 18 months ago by kyliecode • 0

score 0 · Answer 1 · 2023-06-12

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

What is "mm10 from ensembl"? AFAIK there's only _one_ mm10 assembly, and it's from UCSC: https://genome.ucsc.edu/cgi-bin/hgGateway?db=mm10

Do NOT believe what is reported on the UCSC page at the above link that mm10 is based on the GRCm38 assembly from the Genome Reference Consortium. Even though this has been the case for years (since the beginning of the mm10 genome), the UCSC folks updated mm10 in June 2021, so now it's based on the GRCm38.p6 assembly. However they never bothered to update what's displayed at https://genome.ucsc.edu/cgi-bin/hgGateway?db=mm10

Note that you can use registered_UCSC_genomes() from the GenomeInfoDb package to see the correspondance between UCSC genomes and NCBI assemblies:

> library(GenomeInfoDb)

> registered_UCSC_genomes("musculus")
      organism genome NCBI_assembly assembly_accession with_Ensembl circ_seqs
1 Mus musculus    mm8       MGSCv36   GCF_000001635.15        FALSE      chrM
2 Mus musculus    mm9       MGSCv37   GCF_000001635.18        FALSE      chrM
3 Mus musculus   mm10     GRCm38.p6   GCF_000001635.26         TRUE      chrM
4 Mus musculus   mm39        GRCm39    GCA_000001635.9         TRUE      chrM

Anyways, mm10 is already available as a BSgenome package so you don't need to forge your own package for this assembly:

> library(BSgenome)

> grep("musculus", available.genomes(), value=TRUE)

[1] "BSgenome.Mmusculus.UCSC.mm10"        "BSgenome.Mmusculus.UCSC.mm10.masked"
[3] "BSgenome.Mmusculus.UCSC.mm39"        "BSgenome.Mmusculus.UCSC.mm8"        
[5] "BSgenome.Mmusculus.UCSC.mm8.masked"  "BSgenome.Mmusculus.UCSC.mm9"        
[7] "BSgenome.Mmusculus.UCSC.mm9.masked"

Cheers,

H.

ADD COMMENT • link 18 months ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé,

Thanks for your reply!

Sorry for the wrong term I have used. The reference genome I am using is GRCm38 from ensembl. So, I have to forge it, right?

Thanks!

Kylie

ADD REPLY • link 18 months ago kyliecode • 0

0

Entering edit mode

GRCm38.p6 is a patched version of GRCm38 that only _adds_ new sequences to it. I encourage you to spend some time reading about how the Genome Reference Consortium manages assembly releases/versions/names/patches. This goes beyond Bioconductor and is general knowledge useful to any computational biologist.

So if you've aligned your data against GRCm38, then you should be able to use GRCm38.p6 (a.k.a. mm10) for your downstream analysis. Note that the opposite wouldn't work in general because of the risk that a small subset of your data got aligned to sequences in GRCm38.p6 that are not in GRCm38.

Finally note that even though the sequences in GRCm38.p6 and mm10 are the same, their names differ (the UCSC folks love to rename sequences). But you can easily switch between the UCSC names and the original names with seqlevelsStyle():

library(BSgenome.Mmusculus.UCSC.mm10)

seqinfo(BSgenome.Mmusculus.UCSC.mm10)
# Seqinfo object with 239 sequences (1 circular) from mm10 genome:
#   seqnames           seqlengths isCircular genome
#   chr1                195471971      FALSE   mm10
#   chr2                182113224      FALSE   mm10
#   chr3                160039680      FALSE   mm10
#   chr4                156508116      FALSE   mm10
#   chr5                151834684      FALSE   mm10
#  ...                       ...        ...    ...
#   chr11_KZ289077_alt     186144      FALSE   mm10
#   chr11_KZ289078_alt     390920      FALSE   mm10
#   chr11_KZ289079_alt     368967      FALSE   mm10
#   chr11_KZ289080_alt     394982      FALSE   mm10
#   chr11_KZ289081_alt     369973      FALSE   mm10

seqlevelsStyle(BSgenome.Mmusculus.UCSC.mm10)
# [1] "UCSC"

seqlevelsStyle(BSgenome.Mmusculus.UCSC.mm10) <- "NCBI"

seqinfo(BSgenome.Mmusculus.UCSC.mm10)
# Seqinfo object with 239 sequences (1 circular) from GRCm38.p6 genome:
#   seqnames             seqlengths isCircular    genome
#   1                     195471971      FALSE GRCm38.p6
#   2                     182113224      FALSE GRCm38.p6
#   3                     160039680      FALSE GRCm38.p6
#   4                     156508116      FALSE GRCm38.p6
#   5                     151834684      FALSE GRCm38.p6
#   ...                         ...        ...       ...
#   PWK/PHJ_MMCHR11_CTG1     186144      FALSE GRCm38.p6
#   PWK/PHJ_MMCHR11_CTG2     390920      FALSE GRCm38.p6
#   PWK/PHJ_MMCHR11_CTG3     368967      FALSE GRCm38.p6
#   CAST/EI_MMCHR11_CTG4     394982      FALSE GRCm38.p6
#   CAST/EI_MMCHR11_CTG5     369973      FALSE GRCm38.p6

Best,

H.

ADD REPLY • link 18 months ago Hervé Pagès 16k

0

Entering edit mode

Further to what Hervé told you, the difference between Ensembl and UCSC/NCBI is primarily where the genes/transcripts/exons are in the genome, and how many of each a given gene might have. Here is a random gene we can use as an example.

> library(Mus.musculus)
> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2023-04-06
## get the latest EnsDb for mouse
> ensdb <- hub[["AH109655"]]
loading from cache

## Get transcript positions for Trpkb based on UCSC/NCBI

> transcriptsBy(Mus.musculus, column = "SYMBOL")["69786"]
'select()' returned 1:1 mapping between keys and columns
GRangesList object of length 1:
$`69786`
GRanges object with 13 ranges and 2 metadata columns:
       seqnames            ranges strand |               tx_name
          <Rle>         <IRanges>  <Rle> |           <character>
   [1]     chr6 85911865-85915225      + |  ENSMUST00000202142.3
   [2]     chr6 85911877-85928933      + |  ENSMUST00000200680.3
   [3]     chr6 85911891-85915611      + |  ENSMUST00000201289.3
   [4]     chr6 85911907-85924377      + |  ENSMUST00000201217.3
   [5]     chr6 85915719-85927608      + |  ENSMUST00000113751.7
   ...      ...               ...    ... .                   ...
   [9]     chr6 85915747-85930282      + | ENSMUST00000067137.13
  [10]     chr6 85915755-85929025      + |  ENSMUST00000113752.7
  [11]     chr6 85915756-85924616      + |  ENSMUST00000150249.7
  [12]     chr6 85915757-85927891      + |  ENSMUST00000201939.3
  [13]     chr6 85915798-85927940      + |  ENSMUST00000149026.4
                SYMBOL
       <CharacterList>
   [1]           Tprkb
   [2]           Tprkb
   [3]           Tprkb
   [4]           Tprkb
   [5]           Tprkb
   ...             ...
   [9]           Tprkb
  [10]           Tprkb
  [11]           Tprkb
  [12]           Tprkb
  [13]           Tprkb
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

## Now the Ensembl positions

> transcriptsBy(ensdb)["ENSMUSG00000054226"]
GRangesList object of length 1:
$ENSMUSG00000054226
GRanges object with 14 ranges and 11 metadata columns:
       seqnames            ranges strand |              tx_id
          <Rle>         <IRanges>  <Rle> |        <character>
   [1]        6 85888847-85892207      + | ENSMUST00000202142
   [2]        6 85888859-85905915      + | ENSMUST00000200680
   [3]        6 85888873-85892593      + | ENSMUST00000201289
   [4]        6 85888889-85901359      + | ENSMUST00000201217
   [5]        6 85891064-85893027      + | ENSMUST00000202588
   ...      ...               ...    ... .                ...
  [10]        6 85892729-85907264      + | ENSMUST00000067137
  [11]        6 85892737-85906007      + | ENSMUST00000113752
  [12]        6 85892738-85901598      + | ENSMUST00000150249
  [13]        6 85892739-85904873      + | ENSMUST00000201939
  [14]        6 85892780-85904922      + | ENSMUST00000149026
                   tx_biotype tx_cds_seq_start tx_cds_seq_end
                  <character>        <integer>      <integer>
   [1] protein_coding_CDS_n..             <NA>           <NA>
   [2]         protein_coding         85901364       85905844
   [3] protein_coding_CDS_n..             <NA>           <NA>
   [4] protein_coding_CDS_n..             <NA>           <NA>
   [5]        retained_intron             <NA>           <NA>
   ...                    ...              ...            ...
  [10]         protein_coding         85901364       85905844
  [11]         protein_coding         85901364       85905844
  [12] protein_coding_CDS_n..             <NA>           <NA>
  [13]         protein_coding         85901364       85904873
  [14]         protein_coding         85901364       85904922
                  gene_id tx_support_level         tx_id_version gc_content
              <character>        <integer>           <character>  <numeric>
   [1] ENSMUSG00000054226                3  ENSMUST00000202142.4    49.2537
   [2] ENSMUSG00000054226                3  ENSMUST00000200680.4    43.7346
   [3] ENSMUSG00000054226                3  ENSMUST00000201289.4    50.7853
   [4] ENSMUSG00000054226                5  ENSMUST00000201217.4    50.9847
   [5] ENSMUSG00000054226             <NA>  ENSMUST00000202588.2    49.7454
   ...                ...              ...                   ...        ...
  [10] ENSMUSG00000054226                1 ENSMUST00000067137.14    46.0295
  [11] ENSMUSG00000054226                5  ENSMUST00000113752.8    43.6582
  [12] ENSMUSG00000054226                3  ENSMUST00000150249.8    42.7822
  [13] ENSMUSG00000054226                3  ENSMUST00000201939.4    43.5789
  [14] ENSMUSG00000054226                3  ENSMUST00000149026.5    54.4010
       tx_external_name tx_is_canonical            tx_name
            <character>       <integer>        <character>
   [1]        Tprkb-212               0 ENSMUST00000202142
   [2]        Tprkb-208               0 ENSMUST00000200680
   [3]        Tprkb-210               0 ENSMUST00000201289
   [4]        Tprkb-209               0 ENSMUST00000201217
   [5]        Tprkb-213               0 ENSMUST00000202588
   ...              ...             ...                ...
  [10]        Tprkb-201               0 ENSMUST00000067137
  [11]        Tprkb-204               0 ENSMUST00000113752
  [12]        Tprkb-207               0 ENSMUST00000150249
  [13]        Tprkb-211               0 ENSMUST00000201939
  [14]        Tprkb-206               0 ENSMUST00000149026
  -------
  seqinfo: 39 sequences (1 circular) from GRCm39 genome

So UCSC says there are 13 transcripts, and Ensembl says there are 14. Let's look at the first transcript from each.

## UCSC

GRanges object with 13 ranges and 2 metadata columns:
       seqnames            ranges strand |               tx_name
          <Rle>         <IRanges>  <Rle> |           <character>
   [1]     chr6 85911865-85915225      + |  ENSMUST00000202142.3

## Ensembl

GRanges object with 14 ranges and 11 metadata columns:
       seqnames            ranges strand |              tx_id
          <Rle>         <IRanges>  <Rle> |        <character>
   [1]        6 85888847-85892207      + | ENSMUST00000202142

That's supposed to be the same exact transcript, but the genomic positions are different. In fact, none of the transcripts from Ensembl overlap any of the transcripts from UCSC/NCBI! Here is a plot of UCSC (above) and Ensembl (below).

Tprkb

ADD REPLY • link 18 months ago James W. MacDonald 67k