Attempting to use the filterVcf from the VariantAnnotation package. There's something wrong in the header of the VCF filter file, but I can't for the life of me figure out what - I have no idea what to make of the error message. And with out being able to figure out what's wrong, I can't fix whatever is generating the error upstream.
When attempting to filter a VCF file, it fails with the following message:
Error in value[[3L]](cond) :
Filter '' failed: shape of 'skeleton' is not compatible with 'NROW(flesh)'
The relevant code that generates this is:
regionFilter = function(x) grepl("exonic|splicing|UTR5|upstream", x, perl = TRUE)
infoFilter = function(x) {
blackList = info(x)$wgEncodeDacMapabilityConsensusExcludable == "." # discard variants in Encode DAC black list regions
germline = info(x)$SS == 1 # germline calls apparently.
return(unlist(blackList & germline))
}
## filters on the geno region.
genoFilter = function(x) {
Ndepth = 8 # N overall depth
N.var.depth = 3 # N ALT supporting depth
nd = (geno(x)$AD[,"NORMAL"] + geno(x)$RD[, "NORMAL"]) >= Ndepth
nvd = geno(x)$AD[,"NORMAL"] >= N.var.depth
unlist(nd & nvd)
}
## contstruct the filter lists.
PF = FilterRules(list(regionFilter))
FF = FilterRules(list( infoFilter, genoFilter ))
## Process
snpVCF = "/pequod/project/cancer-seq-pipeline/data/intermediate/test.vcf.gz"
snpFilter = filterVcf(snpVCF , "hg19", tempfile(), filters = FF, prefilters = PF )
Apart from the fact that I stripped out all but the first couple of variant calls, and ran it through with the just the header (same error), the only reference I can find to this error message is this one from a few years ago, which suggests that there's something wrong with the headers. What was wrong there, is not the problem though - my GT and DP feilds are present:
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
The only difference that I can see is the PL field where is says the number is G, rather than 0 or 1. I presume this is correct as I haven't been able to find any specification of the VCF format where G is listed as an option, but if I change it to 1, it does some sort of sanity check in the background, and throws a warning saying it should be G.
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
The header is ~230 lines long so rather than paste it here I've put a copy up on pastebin here, hopefully someone else can see what's wrong.
> > sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS
>
> Matrix products: default BLAS:
> /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK:
> /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
>
> locale: [1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_NZ.UTF-8 LC_COLLATE=en_NZ.UTF-8 [5]
> LC_MONETARY=en_NZ.UTF-8 LC_MESSAGES=en_NZ.UTF-8 [7]
> LC_PAPER=en_NZ.UTF-8 LC_NAME=C [9]
> LC_ADDRESS=C LC_TELEPHONE=C [11]
> LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages: [1] stats4 parallel stats graphics
> grDevices utils datasets [8] methods base
>
> other attached packages: [1] reticulate_1.13 jsonlite_1.6
> [3] httr_1.4.1 here_0.1 [5]
> forcats_0.4.0 stringr_1.4.0 [7]
> dplyr_0.8.3 purrr_0.3.2 [9]
> readr_1.3.1 tidyr_1.0.0 [11]
> tibble_2.1.3 ggplot2_3.2.1 [13]
> tidyverse_1.2.1 VariantAnnotation_1.30.1 [15]
> Rsamtools_2.0.0 Biostrings_2.52.0 [17]
> XVector_0.24.0 SummarizedExperiment_1.14.0 [19]
> DelayedArray_0.10.0 BiocParallel_1.18.0 [21]
> matrixStats_0.55.0 Biobase_2.44.0 [23]
> GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 [25]
> IRanges_2.18.1 S4Vectors_0.22.0 [27]
> BiocGenerics_0.30.0
>
> loaded via a namespace (and not attached): [1] bit64_0.9-7
> modelr_0.1.5 assertthat_0.2.1 [4] blob_1.2.0
> BSgenome_1.52.0 GenomeInfoDbData_1.2.1 [7]
> cellranger_1.1.0 progress_1.2.2 pillar_1.4.2
> [10] RSQLite_2.1.2 backports_1.1.4 lattice_0.20-38
> [13] glue_1.3.1 digest_0.6.20 rvest_0.3.4
> [16] colorspace_1.4-1 Matrix_1.2-17 XML_3.98-1.20
> [19] pkgconfig_2.0.2 broom_0.5.2 biomaRt_2.40.1
> [22] haven_2.1.1 zlibbioc_1.30.0 scales_1.0.0
> [25] generics_0.0.2 withr_2.1.2
> GenomicFeatures_1.36.4 [28] lazyeval_0.2.2 cli_1.1.0
> magrittr_1.5 [31] crayon_1.3.4 readxl_1.3.1
> memoise_1.1.0 [34] nlme_3.1-141 xml2_1.2.2
> tools_3.6.1 [37] prettyunits_1.0.2 hms_0.5.1
> lifecycle_0.1.0 [40] munsell_0.5.0
> AnnotationDbi_1.46.0 compiler_3.6.1 [43] rlang_0.4.0
> grid_3.6.1 RCurl_1.95-4.12 [46] rstudioapi_0.10
> bitops_1.0-6 gtable_0.3.0 [49] DBI_1.0.0
> R6_2.4.0 GenomicAlignments_1.20.1 [52] lubridate_1.7.4
> rtracklayer_1.44.0 bit_1.1-14 [55] zeallot_0.1.0
> rprojroot_1.3-2 stringi_1.4.3 [58] Rcpp_1.0.2
> vctrs_0.2.0 tidyselect_0.2.5
Can you share enough of the VCF file so that I can reproduce this? It might also help to name your filters, so that the error report indicates which filter the error occurs in, e.g.,
Alternatively, you could try to evaluate
traceback()
as soon as the error occurs. This will provide some indication about where the error occurs. You might also try a sequence likeI think you'll see something intimidating like
This is presenting the call stack indicating the sequence of function calls that eventually lead to the error. In the example above you could try different 'frames' and look around, e.g.,
Where at the
Browse[1]>
prompt you are 'in' the function at that place in the call stack and you can type almost all R commands, as well as some special commands likec
(forcontinue
) to continue evaluation.