I have a (large) bgzipped VCF file. I would like to read through it by chunks of records (say, 1000), doing something on these, and saving them back to another bgzipped VCF file.
Following the examples at the bottom of ?TabixFile
and ?readVCF
, here is what I'm doing for the moment:
library(VariantAnnotation) tabix.file <- TabixFile(file="input.vcf.gz", yieldSize=1000) open(tabix.file) con <- file("output.vcf", open="a") while(nrow(vcf <- readVcf(tabix.file, "mygenome"))){ ... # do something on the "vcf" object writeVcf(obj=vcf, filename=con) } close(con) close(tabix.file) bgzip("output.vcf") indexTabix("output.vcf.bgz", "vcf")
It doesn't seem very efficient to first append iteratively to an uncompressed file, and then to compress it. But is there a way to directly append to a bgzipped VCF file? Or maybe there is a completely different way?
bgzip
is indeed close togzip
, but it has an indexing scheme to allow random access (more here). This means that usinggzfile()
would not have the indexing scheme, unfortunately. In fact, after looking at the source code ofwriteVcf()
, it looks like there is no other way than what I wrote in my question.It looks like all you would need is a BC field in the gzip header that indicates the size of the block. Since that depends only on the block, blocks are independent, and it should be possible to concatenate bgzf streams. In theory, Rsamtools could support writing bgzf to an anonymous pipe and return the data to R as a raw vector or connection. The data could then be written in append mode. It's also conceivable that one could use gzcon to write a gzip block into a raw vector, and then stream over the gzip header in R, and insert the BC field before writing. The header can't be that long.
I've a hard time concretely understanding what you suggest. Inside the
while
loop, I have a VCF object to write, butbgzip
requires a file as input. So do you mean I should hack into theRsamtools
pkg to allowbgzip
to take a connection as input? If so, by looking at the current C code, I guess, I should add abgzip_con
function, taking a pointer to aBGZF
object as input?Maybe it would be as simple as opening the bgzf file in append mode? From looking at the code, it looks like the
bgzf_open
function in samtools does not support appending, but since Rsamtools has already modified that function for Windows support, perhaps we could make another simple change.That would do the trick, indeed.
Do you want me to ask first the SAMtools team to add support for "append"?By looking at the latest version of htslib (master branch, here), it seems thatbgzf_open
does now support "a". Thus, is it possible to update Rsamtools with htslib v1.2.1?I think that there is some movement afoot to port Rsamtools to use htslib (via Rhtslib), but that has not yet happened.