Hello,
I am new to the VariantAnnotation package. I am trying to filter a 4-species vcf with the below code. I am having 2 problems: 1) I am not recovering the one site in my test file where all 4 species have RGQ>=30 (along with with variant sites that are not filtered), and 2) the header is of the vcf file is being printed multiple times (~55 times). I will provide any further information that would be helpful.
many thanks,
Stephen
code:
library("VariantAnnotation") ref <- "ref" x <- readVcf("test.vcf.gz", genome=ref) ##prefilter ##only those that have RGQ isRGQ <- function(x) { grepl("RGQ", x, fixed=TRUE) } ##filters RGQ_filter <- function(x, value=30) { ## Filter on RGQ > value; default is 30 test <- apply(geno(x)$RGQ>=value, 1, sum)==4 out <- is.na(test) & test out[is.na(out)] <- TRUE return(out) } prefilters <- FilterRules(list(RGQ_prefilter=isRGQ)) filters <- FilterRules(list(RGQ_filter=RGQ_filter)) file.gz <- "test.vcf.gz" file.gz.tbi <- "test.vcf.gz.tbi" destination.file <- "test2.vcf.gz" size=1 tabix.file <- TabixFile(file.gz, yieldSize=size) destination.file <- tempfile() filterVcf(file=tabix.file, genome=ref, destination=destination.file, filters=filters, verbose=TRUE)
You'll want to use a yieldSize much larger than 1, say 10,000 - 100,000, so that you benefit from R's vectorized calculations.
Hello Martin,
When I used the TabixFile function, the vcf header was being written into the file multiple times. The function is running, but it is slow. Is there a way around this?
There is a report of this here: https://stat.ethz.ch/pipermail/bioc-devel/2015-August/007877.html.
I have 1.16.4 installed. Maybe a regression? Thank you for the help.
When I look at the filterVcf,character-method, I see
so actually 'under the hood' it is using TabixFile; it seems like the key is to open it before calling
filterVcf()
. But maybe it's good enough to just usefile.gz
-- is there a reason that you want to useTabixFile()
explicitly? And again, using yieldSize=1 is very inefficient.Also, I believe that you can use the param argument and ScanVcfParam() to select just the fields required for the filter
I was using yield size 1 to test. I am still experiencing the multiple header problem. Thank you for all of the help.
Relevant code snipette:
size=10
tabix.file <- open(TabixFile(file.gz, yieldSize=size))
#destination.file <- tempfile()
filterVcf(file=tabix.file, genome=ref, destination=destination.file, filters=filters, verbose=TRUE)
close(tabix.file)
Popping out to top level comment to give some more room. For a reproducible example I did this, modified from ?filterVcf
The output (abbreviate) was
which happened very quickly; there were no duplicate headers. Can you perform the same? If you experience problems, can you add the output of sessionInfo() to your post? Also, if you could provide your VCF file (or a small portion of it, on dropbox or similar, or to me at martin.morgan at roswellpark.org) then I will look into this further.
A speed-up is to replace the apply() function with rowSums()
This works fine. My script is close to working. I updated Bioconductor, and utilized your suggestions. I appreciate all of your help.