Question

BSgenomeForge error in .TwoBits_export

0

Entering edit mode

stu111538 • 0

@stu111538-13994

Last seen 4.6 years ago

Germany, Kiel, University Hospital Kiel

Hello,

Using "forgeBSgenomeDataPkg" I get an error I cannot solve. The problem seems to be to create the twobit file. I have write permission in the destdir and enough disc space on my drive. The sequence data are located in a directory mounted from a linux-server, but I think this is not the problem, since the fasta files were loaded.

This is my seed-file:

Package: BSgenome.Hsapiens.NCBI.GRCh38.p11
Title: Homo Sapiens full genome for RRBS (NCBI version GRCh38.p11)
Description: Homo Sapiens full genome as provided by NCBI
organism: Homo sapiens
common_name: Human
provider: NCBI
provider_version: GRCh38.p11
release_date: 14/06/17
release_name: GRCh38.p11
source_url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/
organism_biocview: Homo_sapiens
BSgenomeObjname: Hsapiens
seqnames: c("NT_187687.1","NT_113949.2","NC_012920.1","RRBS_spike-in_SQ6hmC","RRBS_spike-in_SQ1hmC","RRBS_spike-in_SQC","RRBS_spike-in_SQmC","RRBS_spike-in_SQfC","RRBS_spike-in_DC_100","RRBS_spike-in_CC_100")
circ_seqs: "NC_012920.1"
seqs_srcdir: /home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/seqs_srcdir
seqfiles_suffix: .fa

This is the command I was running:

>forgeBSgenomeDataPkg("/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/DESTCRIPTION", destdir = "/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/", verbose = TRUE)

This is the end of the stdout I get:

...

Loading 'RRBS_spike-in_DC_100' sequence from FASTA file '/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/seqs_srcdir/RRBS_spike-in_DC_100.fa' ... DONE
Loading 'RRBS_spike-in_CC_100' sequence from FASTA file '/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/seqs_srcdir/RRBS_spike-in_CC_100.fa' ... DONE
Writing all sequences to '/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package//BSgenome.Hsapiens.NCBI.GRCh38.p11/inst/extdata/single_sequences.2bit' ...
error in .TwoBits_export(mapply(.DNAString_to_twoBit, object, seqnames), :
UCSC library operation failed
Zusätzlich: Es gab 17 Warnungen (Anzeige mit warnings()) [translation: additionally there were 17 warnings]

The created output directory only contains those four elements: DESCRIPTION and NAMESPACE file, R/ and man/ directory. There is no genome created.

> traceback()

15: .Call(TwoBits_write, object, con)
14: .TwoBits_export(mapply(.DNAString_to_twoBit, object, seqnames),
twoBitPath(path(con)))
13: .local(object, con, format, ...)
12: export(object, FileForFormat(con, format), ...)
11: export(object, FileForFormat(con, format), ...)
10: export(seqs, dest_filepath, format = "2bit")
9: export(seqs, dest_filepath, format = "2bit")
8: .forgeTwobitFileFromFastaFiles(seqnames, prefix, suffix, seqs_srcdir,
seqs_destdir, verbose = verbose)

7: forgeSeqFiles(.seqnames, mseqnames = .mseqnames, seqfile_name = x@seqfile_name,

prefix = x@seqfiles_prefix, suffix = x@seqfiles_suffix, seqs_srcdir = seqs_srcdir,

seqs_destdir = seqs_destdir, ondisk_seq_format = x@ondisk_seq_format,

verbose = verbose)

6: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,

verbose = verbose)

5: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,

verbose = verbose)

4: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,

verbose = verbose)

3: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,

verbose = verbose)

2: forgeBSgenomeDataPkg("/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/DESTCRIPTION",

destdir = "/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/",

verbose = TRUE)

1: forgeBSgenomeDataPkg("/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/DESTCRIPTION",

destdir = "/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/",

verbose = TRUE)

> sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default

BLAS: /usr/lib/libblas/libblas.so.3.0

LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:

[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8

[6] LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:

[1] BSgenome_1.44.1 rtracklayer_1.36.4 Biostrings_2.44.2 XVector_0.16.0 GenomicRanges_1.28.5 GenomeInfoDb_1.12.2 IRanges_2.10.3

[8] S4Vectors_0.14.4 BiocGenerics_0.22.0

loaded via a namespace (and not attached):

[1] lattice_0.20-35 matrixStats_0.52.2 XML_3.98-1.9 Rsamtools_1.28.0 GenomicAlignments_1.12.2

[6] bitops_1.0-6 grid_3.4.1 zlibbioc_1.22.0 Matrix_1.2-10 BiocParallel_1.10.1

[11] tools_3.4.1 Biobase_2.36.2 RCurl_1.95-4.8 DelayedArray_0.2.7 compiler_3.4.1

[16] SummarizedExperiment_1.6.4 GenomeInfoDbData_0.99.0

I appreciate every hint! Thanks in advance!

Lena

bsgenome forge TwoBits_export • 2.3k views

ADD COMMENT • link updated 7.6 years ago by Hervé Pagès 16k • written 7.6 years ago by stu111538 • 0

0

Entering edit mode

Hi Lena,

Not sure what's going on. But I wonder what these RRBS_spike-in_* sequences are and where you downloaded the RRBS_spike-in_*.fa files from. The source_url field you provide points to a folder that contains the GCF_000001405.37_GRCh38.p11_genomic.fna.gz file (this is a FASTA file that contains all the genomic sequences). This suggests that you downloaded this file to get the sequences but that file doesn't seem to contain the RRBS_spike-in_* sequences. Puzzling! Furthermore, your seqnames field only contains 10 sequences but the GRCh38.p11 assembly has 578 sequences! See:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/GCF_000001405.37_GRCh38.p11_assembly_report.txt

Note that the official assembly report for GRCh38.p11 doesn't contain any RRBS_spike-in_* sequences either.

All this is somewhat confusing and suggests that you might actually be trying to do something quite different than forging a BSgenome package for GRCh38.p11. It would help if you could clarify what you are exactly trying to do.

FWIW note that if, instead of using one FASTA file per sequence (like you seem to be doing), you use a single FASTA file that contains all the sequences (e.g. GCF_000001405.37_GRCh38.p11_genomic.fna.gz), then you don't need to specify the seqnames field. Note however that GCF_000001405.37_GRCh38.p11_genomic.fna.gz cannot be used as-is and requires the following massage before it can be used to forge the BSgenome package:

library(Biostrings)
fasta_file <- "GCF_000001405.37_GRCh38.p11_genomic.fna.gz"
genomic_dna <- readDNAStringSet(fasta_file)

## Fix the names.
names(genomic_dna) <- sub(" .*$", "", names(genomic_dna))
library(GenomeInfoDb)
assembly_report <- GenomeInfoDb:::fetch_assembly_report("GCA_000001405.26")
m <- match(assembly_report[ , "RefSeqAccn"], names(genomic_dna))
stopifnot(!anyNA(m))
genomic_dna <- genomic_dna[m]
names(genomic_dna) <- assembly_report[ , "SequenceName"]

## Replace non A/C/G/T/N letters with Ns.
patterns <- DNAStringSet(setdiff(names(IUPAC_CODE_MAP),
                                 c("A", "C", "G", "T", "N")))
for (seqname in names(genomic_dna)) {
    cat("replacing non A/C/G/T/N letters with Ns in", seqname, "... ")
    subject <- genomic_dna[[seqname]]
    mi <- matchPDict(patterns, subject)
    genomic_dna[[seqname]] <- replaceAt(subject, unlist(mi), "N")
    cat("OK\n")
}

## Write the 2bit file.
export.2bit(genomic_dna, con="GRCh38.p11.2bit")

This will take a few minutes. Then all you need to do is move the GRCh38.p11.2bit file to /home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/seqs_srcdir/ and replace the seqnames and seqfiles_suffix fields of your seed file with:

seqfile_name: GRCh38.p11.2bit

Hope this helps,

H.

ADD REPLY • link 7.6 years ago Hervé Pagès 16k

score 0 · Answer 1 · 2017-09-21

Hi Herve,

sorry, for the confusion! I just added the RRBS-spike sequences for internal control of conversion rate. I also used those sequences for the alignment, that is why I thought it might be better to inlcude them in the GRCh38.p11 reference for BSgenome (I wanted to prevent error messages because of reference sequences in the bamfiles which are not present in the BSgenome). And, I did not list all the GRCh38.p11 sequences for seqnames, since I had to shorten my question here, otherwise it would be to long for this support site.

I tried your suggestion with the "GCF_000001405.37_GRCh38.p11_genomic.fna.gz" file and the GRCh38.p11.2bit file was created properly as I can say.

However, to forge the BSgenome did not work.

> forgeBSgenomeDataPkg("/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/DESTCRIPTION", verbose = TRUE)
Creating package in ./BSgenome.Hsapiens.NCBI.GRCh38.p11
Copying '/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/seqs_srcdir/GRCh38.p11.2bit' to './BSgenome.Hsapiens.NCBI.GRCh38.p11/inst/extdata/single_sequences.2bit' ...
error in .copySeqFile(seqfile_name, seqs_srcdir, seqs_destdir, verbose = verbose) :
FAILED
Zusätzlich: Warnmeldung: [in addition: warning]
In file.create(to[okay]) :
kann Datei './BSgenome.Hsapiens.NCBI.GRCh38.p11/inst/extdata/single_sequences.2bit' nicht erzeugen. Grund 'Datei oder Verzeichnis nicht gefunden' [cannot create './BSgenome.Hsapiens.NCBI.GRCh38.p11/inst/extdata/single_sequences.2bit', directory not found]

The BSgenome.Hsapiens.NCBI.GRCh38.p11 destdir was created, but the twobit file cannot be copied to it. I have write permissions.

> traceback()
9: stop("FAILED")
8: .copySeqFile(seqfile_name, seqs_srcdir, seqs_destdir, verbose = verbose)
7: forgeSeqFiles(.seqnames, mseqnames = .mseqnames, seqfile_name = x@seqfile_name,
       prefix = x@seqfiles_prefix, suffix = x@seqfiles_suffix, seqs_srcdir = seqs_srcdir,
       seqs_destdir = seqs_destdir, ondisk_seq_format = x@ondisk_seq_format,
       verbose = verbose)
6: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,
       verbose = verbose)
5: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,
       verbose = verbose)
4: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,
       verbose = verbose)
3: forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,
       verbose = verbose)
2: forgeBSgenomeDataPkg("/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/DESTCRIPTION",
       verbose = TRUE)
1: forgeBSgenomeDataPkg("/home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/DESTCRIPTION",
       verbose = TRUE)

my seed-file:

Package: BSgenome.Hsapiens.NCBI.GRCh38.p11
Title: Homo Sapiens full genome (NCBI version GRCh38.p11)
Description: Homo Sapiens full genome as provided by NCBI
Version: 1.0
Author: Lena Möbus
organism: Homo sapiens
common_name: Human
provider: NCBI
provider_version: GRCh38.p11
release_date: 14/06/17
release_name: GRCh38.p11
source_url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/
organism_biocview: Homo_sapiens
BSgenomeObjname: Hsapiens
circ_seqs: "NC_012920.1"
PkgDetails: "This package was self-made with the forgeBSgenomeDataPkg function by Lena Möbus since the reference genome GRCh38.p11 which was used for mapping the MZ Twin RRBS data was not available in BSgenomes. BSgenome was neccessary for MethyPipe R package"
seqs_srcdir: /home/lmoebus/mountrz_work_zfs/references/BSgenome_R_package/seqs_srcdir
seqfile_name: GRCh38.p11.2bit

Thank you very much for your help!

score 0 · Answer 2 · 2017-09-21

In the package destdir 'BSgenome.Hsapiens.NCBI.GRCh38.p11' there is a file called R/zzz.R. This is its content, maybe it helps

###
###

.pkgname <- "BSgenome.Hsapiens.NCBI.GRCh38.p11"

.seqnames <- NULL

.circ_seqs <- "NC_012920.1"

.mseqnames <- NULL

.onLoad <- function(libname, pkgname)
{
    if (pkgname != .pkgname)
        stop("package name (", pkgname, ") is not ",
             "the expected name (", .pkgname, ")")
    extdata_dirpath <- system.file("extdata", package=pkgname,
                                   lib.loc=libname, mustWork=TRUE)

    ## Make and export BSgenome object.
    bsgenome <- BSgenome(
        organism="Homo sapiens",
        common_name="Human",
        provider="NCBI",
        provider_version="GRCh38.p11",
        release_date="14/06/17",
        release_name="GRCh38.p11",
        source_url="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/",
        seqnames=.seqnames,
        circ_seqs=.circ_seqs,
        mseqnames=.mseqnames,
        seqs_pkgname=pkgname,
        seqs_dirpath=extdata_dirpath
    )

ns <- asNamespace(pkgname)

    objname <- pkgname
    assign(objname, bsgenome, envir=ns)
    namespaceExport(ns, objname)

    old_objname <- "Hsapiens"
    assign(old_objname, bsgenome, envir=ns)
    namespaceExport(ns, old_objname)
}

score 0 · Answer 3 · 2017-09-22

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 13 hours ago

Seattle, WA, United States

I see. It looks like the BSgenome.Hsapiens.NCBI.GRCh38.p11 destdir got created but not the inst/extdata/ folder below it.

This should be fixed in BSgenome 1.44.2. The fix should take between 24-48 hours to become available via biocLite().

Sorry for the inconvenience,

H.

ADD COMMENT • link 7.6 years ago Hervé Pagès 16k

0

Entering edit mode

ok, thank you very much! Then, I will wait until Monday and try to install the fixed BSgenome.

ADD REPLY • link 7.6 years ago stu111538 • 0

0

Entering edit mode

It seems the same error is showing in the recent version 1.52.0

ADD REPLY • link 5.7 years ago bioinagesh • 0

0

Entering edit mode

Please open a new issue and describe what happens to you exactly. Also provide your sessionInfo(). Thanks!

ADD REPLY • link 5.7 years ago Hervé Pagès 16k