Building BSgenome package fails - .make_BSgenome_seqinfo error: sequence names found in .2bit file are not identical to 'seqnames'
1
0
Entering edit mode
geo.vogler ▴ 10
@geovogler-20208
Last seen 17 months ago
United States

Hi!

I am trying to forge a BSgenome package for the most current Dmel6 genome using the FASTA file from Ensembl (BDGP6.32 v104). However, the package will not install and stops with an error:

* installing *source* package ‘BSgenome.Dmelanogaster.6.32’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Warning messages:
1: package ‘BiocGenerics’ was built under R version 4.0.5 
2: package ‘GenomeInfoDb’ was built under R version 4.0.5 
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
Warning: package ‘BiocGenerics’ was built under R version 4.0.5
Warning: package ‘GenomeInfoDb’ was built under R version 4.0.5
Error: package or namespace load failed for ‘BSgenome.Dmelanogaster.6.32’:
 .onLoad failed in loadNamespace() for 'BSgenome.Dmelanogaster.6.32', details:
  call: .make_BSgenome_seqinfo(single_sequences, circ_seqs, genome, seqnames)
  error: sequence names found in file '/Users/Geo/BSgenome.Dmelanogaster.6.32.Rcheck/00LOCK-BSgenome.Dmelanogaster.6.32/00new/BSgenome.Dmelanogaster.6.32/extdata/single_sequences.2bit' are not identical to 'seqnames'. May be the data on disk is corrupted?
Error: loading failed
Execution halted
ERROR: loading failed

I prepared the genome FASTA and created the 2bit file as follows:

library(Biostrings)

dna <- readDNAStringSet("Drosophila_melanogaster.BDGP6.32.dna.toplevel.fa.gz")

### Check seqnames.
current_SequenceName <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L))

names(dna) <- current_SequenceName

### Export as 2bit.
library(rtracklayer)
export.2bit(dna, "Dmel632.2bit")

Using the original sequence names or shortened names appeared not to make a difference. Any advice would be greatly appreciated!

Geo

BSGenome Forge • 1.2k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 6 hours ago
Seattle, WA, United States

Hi,

Any reason you can't use the BSgenome.Dmelanogaster.UCSC.dm6 package?

What Ensembl calls the "BDGP6.32" assembly has GenBank accession GCA_000001215.4 and RefSeq accession GCF_000001215.4. This is what NCBI calls the "Release 6 plus ISO1 MT" assembly and what UCSC calls the "dm6" genome. The only difference between BDGP6.32 and dm6 are the sequence names. BDGP6.32 and Release 6 plus ISO1 MT use the same sequence names a.k.a. the NCBI sequence names.

Switching from UCSC naming style to NCBI naming style is easy:

library(BSgenome.Dmelanogaster.UCSC.dm6)
seqinfo(BSgenome.Dmelanogaster.UCSC.dm6)
# Seqinfo object with 1870 sequences (1 circular) from dm6 genome:
#  seqnames         seqlengths isCircular genome
#  chr2L              23513712      FALSE    dm6
#  chr2R              25286936      FALSE    dm6
#  chr3L              28110227      FALSE    dm6
#  chr3R              32079331      FALSE    dm6
#  chr4                1348131      FALSE    dm6
#  ...                     ...        ...    ...
#  chrUn_DS485998v1       1003      FALSE    dm6
#  chrUn_DS486002v1       1001      FALSE    dm6
#  chrUn_DS486004v1       1001      FALSE    dm6
#  chrUn_DS486005v1       1001      FALSE    dm6
#  chrUn_DS486008v1       1001      FALSE    dm6

seqlevelsStyle(BSgenome.Dmelanogaster.UCSC.dm6) <- "NCBI"
seqinfo(BSgenome.Dmelanogaster.UCSC.dm6)
# Seqinfo object with 1870 sequences (1 circular) from Release 6 plus ISO1 MT genome:
#   seqnames        seqlengths isCircular                 genome
#   2L                23513712      FALSE Release 6 plus ISO1 MT
#   2R                25286936      FALSE Release 6 plus ISO1 MT
#   3L                28110227      FALSE Release 6 plus ISO1 MT
#   3R                32079331      FALSE Release 6 plus ISO1 MT
#   4                  1348131      FALSE Release 6 plus ISO1 MT
#   ...                    ...        ...                    ...
#   211000022278074       1003      FALSE Release 6 plus ISO1 MT
#   211000022278179       1001      FALSE Release 6 plus ISO1 MT
#   211000022279016       1001      FALSE Release 6 plus ISO1 MT
#   211000022278576       1001      FALSE Release 6 plus ISO1 MT
#   211000022278384       1001      FALSE Release 6 plus ISO1 MT

FWIW here is how you can check that BDGP6.32 uses the NCBI sequence names:

library(GenomeInfoDb)
chrominfo <- getChromInfoFromEnsembl("BDGP6.32", map.NCBI=TRUE)
identical(chrominfo$name, chrominfo$NCBI.SequenceName)  # TRUE

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thanks! Took me a while to figure out that all BDGP6.nn releases are dm6-compatible. My errors in my pipeline were due to something else, not the release versions. Oh, well. Your examples will be very helpful in the future.

I was just puzzled that I could forge e.g. a Drosophila pseudoobscura version for BSgenome (dp6, which was not available then), but not create dm6 anew.

ADD REPLY

Login before adding your answer.

Traffic: 760 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6