I'm experimenting with custom VCF formats. More specifically, some colleagues and I are working on establishing standards for polyploid data in non-model organisms.
I'd like to make use of the Sample field format as described in the current VCF specification part 1.4.8. It especially seems useful for long-term data preservation, enabling us to put all sample metadata into the file.
I made the following dummy file to see if I could import data from ##SAMPLE lines.
##fileformat=VCFv4.3
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##META=<ID=Species,Type=String,Number=.,Description="Species name">
##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome">
##SAMPLE=<ID=Sam1,Species="Miscanthus sinensis",Ploidy="2x",Description="Sample1">
##SAMPLE=<ID=Sam2,Species="Miscanthus sacchariflorus",Ploidy="4x",Description="Sample2">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sam1 Sam2
Chr01 1000 S01_1000 G A 100 PASS DP=20 GT:AD 0/0:10,0 0/1:5,5
readVcf works, but I can't find the sample metadata anywhere in the vcf object. It seems like colData would be the appropriate place to put it. If I manually add columns to colData, writeVcf gives an error.
myvcf <- readVcf("testvcf.vcf") colData(myvcf) colData(myvcf)$Ploidy <- c("2x", "4x") writeVcf(myvcf, file = "outvcf.vcf")
Is there something I'm missing? Or if not, what is the possibility of this being implemented in VariantAnnotation at some point in the future?
> sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.26.1 Rsamtools_1.32.3 Biostrings_2.48.0 [4] XVector_0.20.0 SummarizedExperiment_1.10.1 DelayedArray_0.6.6 [7] BiocParallel_1.14.2 matrixStats_0.54.0 Biobase_2.40.0 [10] GenomicRanges_1.32.7 GenomeInfoDb_1.16.0 IRanges_2.14.12 [13] S4Vectors_0.18.3 BiocGenerics_0.26.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.18 compiler_3.5.0 prettyunits_1.0.2 progress_1.2.0 [5] GenomicFeatures_1.32.2 bitops_1.0-6 tools_3.5.0 zlibbioc_1.26.0 [9] biomaRt_2.36.1 digest_0.6.17 bit_1.1-14 BSgenome_1.48.0 [13] evaluate_0.11 RSQLite_2.1.1 memoise_1.1.0 lattice_0.20-35 [17] pkgconfig_2.0.2 rlang_0.2.2 Matrix_1.2-14 rstudioapi_0.7 [21] DBI_1.0.0 yaml_2.2.0 GenomeInfoDbData_1.1.0 rtracklayer_1.40.6 [25] httr_1.3.1 stringr_1.3.1 knitr_1.20 hms_0.4.2 [29] rprojroot_1.3-2 bit64_0.9-7 grid_3.5.0 R6_2.2.2 [33] AnnotationDbi_1.42.1 XML_3.98-1.16 rmarkdown_1.10 blob_1.1.1 [37] magrittr_1.5 GenomicAlignments_1.16.0 backports_1.1.2 htmltools_0.3.6 [41] assertthat_0.2.0 stringi_1.1.7 RCurl_1.95-4.11 crayon_1.3.4
Although, there is still an issue with writeVcf, and looking at the source code it seems to be that now I have another data frame called META which creates a conflict with the META data frame containing the VCF version number.
I can confirm this. None of our example VCF in VariantAnnotation use the META header record type
(https://samtools.github.io/hts-specs/VCFv4.3.pdf section 1.4.8). Additionally writeVcf-methods.Rd mentions Bioconductor 3.1 and the contemporaneous standard v4.2 of VCF spec. It looks like there is some updating to be done in VariantAnnotation for writeVcf to work with your example. NB -- one can't cut and paste your example VCF because tabs are not preserved. This threw me for a bit.
Whoops, sorry about the tabs going away! And thank you for looking into this.
Thanks for the report. The 'META' name clash is unfortunate. I'll work on a solution and should have something in they next couple of days.
Valerie