Hi!
I am trying to use R packages which require BSgenome to specify the genome. My datasets are from mouse aligned aginst mm10 from ensembl rather than UCSC. So, am I supposed to forge the BSgenome myself?
Thanks in advance!
Kylie
Hi!
I am trying to use R packages which require BSgenome to specify the genome. My datasets are from mouse aligned aginst mm10 from ensembl rather than UCSC. So, am I supposed to forge the BSgenome myself?
Thanks in advance!
Kylie
What is "mm10 from ensembl"? AFAIK there's only _one_ mm10
assembly, and it's from UCSC: https://genome.ucsc.edu/cgi-bin/hgGateway?db=mm10
Do NOT believe what is reported on the UCSC page at the above link that mm10
is based on the GRCm38
assembly from the Genome Reference Consortium. Even though this has been the case for years (since the beginning of the mm10
genome), the UCSC folks updated mm10
in June 2021, so now it's based on the GRCm38.p6
assembly. However they never bothered to update what's displayed at https://genome.ucsc.edu/cgi-bin/hgGateway?db=mm10
Note that you can use registered_UCSC_genomes()
from the GenomeInfoDb package to see the correspondance between UCSC genomes and NCBI assemblies:
> library(GenomeInfoDb)
> registered_UCSC_genomes("musculus")
organism genome NCBI_assembly assembly_accession with_Ensembl circ_seqs
1 Mus musculus mm8 MGSCv36 GCF_000001635.15 FALSE chrM
2 Mus musculus mm9 MGSCv37 GCF_000001635.18 FALSE chrM
3 Mus musculus mm10 GRCm38.p6 GCF_000001635.26 TRUE chrM
4 Mus musculus mm39 GRCm39 GCA_000001635.9 TRUE chrM
Anyways, mm10
is already available as a BSgenome package so you don't need to forge your own package for this assembly:
> library(BSgenome)
> grep("musculus", available.genomes(), value=TRUE)
[1] "BSgenome.Mmusculus.UCSC.mm10" "BSgenome.Mmusculus.UCSC.mm10.masked"
[3] "BSgenome.Mmusculus.UCSC.mm39" "BSgenome.Mmusculus.UCSC.mm8"
[5] "BSgenome.Mmusculus.UCSC.mm8.masked" "BSgenome.Mmusculus.UCSC.mm9"
[7] "BSgenome.Mmusculus.UCSC.mm9.masked"
Cheers,
H.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Hervé,
Thanks for your reply!
Sorry for the wrong term I have used. The reference genome I am using is
GRCm38
from ensembl. So, I have to forge it, right?Thanks!
Kylie
GRCm38.p6
is a patched version ofGRCm38
that only _adds_ new sequences to it. I encourage you to spend some time reading about how the Genome Reference Consortium manages assembly releases/versions/names/patches. This goes beyond Bioconductor and is general knowledge useful to any computational biologist.So if you've aligned your data against
GRCm38
, then you should be able to useGRCm38.p6
(a.k.a.mm10
) for your downstream analysis. Note that the opposite wouldn't work in general because of the risk that a small subset of your data got aligned to sequences inGRCm38.p6
that are not inGRCm38
.Finally note that even though the sequences in
GRCm38.p6
andmm10
are the same, their names differ (the UCSC folks love to rename sequences). But you can easily switch between the UCSC names and the original names withseqlevelsStyle()
:Best,
H.
Further to what Hervé told you, the difference between Ensembl and UCSC/NCBI is primarily where the genes/transcripts/exons are in the genome, and how many of each a given gene might have. Here is a random gene we can use as an example.
So UCSC says there are 13 transcripts, and Ensembl says there are 14. Let's look at the first transcript from each.
That's supposed to be the same exact transcript, but the genomic positions are different. In fact, none of the transcripts from Ensembl overlap any of the transcripts from UCSC/NCBI! Here is a plot of UCSC (above) and Ensembl (below).