VCF requires unique row names, but the formula used to generate placeholder row names can produce duplicates (Such as for pindel with default parameters run on the NA12878 Illumina platinum genomic WGS data). Is this intentional, or just an oversight due to the lack of structural variant data out there?
Reproduction test case:
library(VariantAnnotation)
library(testthat)
#' loads a VCF containing the given records
#' @param record string vector of record to write
.testrecord <- function(record) {
filename=tempfile(fileext=".vcf")
write(paste0(c(
"##fileformat=VCFv4.2",
"##ALT=<ID=DEL,Description=\"Deletion\">",
"##INFO=<ID=SVLEN,Number=.,Type=Integer,Description=\"Difference in length between REF and ALT alleles\">",
"##contig=<ID=chr1,length=249250621>",
"#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT",
record), collapse="\n"),
file=filename)
vcf <- readVcf(.testfile(filename), "")
# readVcf holds a file handle open so we won't be able to delete the
# file even if we tried
#file.remove(filename)
return(vcf)
}
# the forum will likely mangle the TABs in the in-line VCF lines
test_that("Placeholder names are unique", {
expect_false(any(duplicated(row.names(.testrecord(c(
"chr1 100 . A <DEL> . . SVLEN=-1",
"chr1 100 . A <DEL> . . SVLEN=-2"))))))
})
sessionInfo()
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] StructuralVariantAnnotation_0.1 VariantAnnotation_1.16.4 Rsamtools_1.22.0 Biostrings_2.38.4 XVector_0.10.0 SummarizedExperiment_1.0.2
[7] Biobase_2.30.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1
[13] assertthat_0.1 roxygen2_5.0.1 testthat_0.11.0 devtools_1.10.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 futile.logger_1.4.1 GenomicFeatures_1.22.13 bitops_1.0-6 futile.options_1.0.0 tools_3.2.2
[7] zlibbioc_1.16.0 biomaRt_2.26.1 digest_0.6.9 memoise_1.0.0 RSQLite_1.0.0 BSgenome_1.38.0
[13] DBI_0.3.1 rtracklayer_1.30.4 withr_1.0.1 stringr_1.0.0 BSgenome.Hsapiens.UCSC.hg19_1.4.0 AnnotationDbi_1.32.3
[19] XML_3.98-1.4 BiocParallel_1.4.3 lambda.r_1.1.7 magrittr_1.5 GenomicAlignments_1.6.3 stringi_1.0-1
[25] RCurl_1.95-4.8 crayon_1.3.1
The issue is that VariantAnnotation automatically creates names for the VCF rows that were not assigned any name. If these names are are used to write a VCF file, the resultant VCF will not be valid, because uniqueness is required by the VCF specifications.
In my code, I'm using the VCF ID as a unique record identifier as it is guaranteed to be unique. I added code to generate unique IDs if the ID was NA/"."/"" then ran into this unexpected VariantAnnotation behaviour. As far as I can tell, VariantAnnotation doesn't have an API that will tell the user whether the ID for any given record was from the VCF file itself, or created by the VariantAnnotation package.
Is it possible to either a) not assign any name when a VCF record has a "." in the ID column, b) assign names that conform to the VCF ID field specifications, and/or c) expose the ID field of the original VCF?
Thanks
(a) No, unfortunately not. It's more of an all or nothing thing. You probably know readVcf() has a 'row.names' arg and it's TRUE by default. If the ID is missing a rowname is created from CHROM,POS,REF,ALT. This was an attempt to create unique rownames/identifiers but as you've found doesn't work with some structural variants (and possibly others). The behavior is documented under the 'rowRanges' section of ?readVcf:
rowRanges: The CHROM, POS, ID and REF fields are used to create a
‘GRanges’ object. Ranges are created using POS as the start
value and width of the reference allele (REF). The IDs become
the rownames. If they are missing (i.e., ‘.’) a string of
CHROM:POS_REF/ALT is used instead. The ‘genome’ argument is
stored in the seqinfo of the ‘GRanges’ and can be accessed
with ‘genome(<VCF>)’.
Setting 'row.names=FALSE' makes all rownames NULL whether the ID was missing or not. So, this is only useful if all the IDs are missing.
(b) and (c) When 'row.names=TRUE', the ID field in the original VCF becomes the rownames on rowRanges() so this information is retained and exposed.
Because (I think) you want to keep ID's that are not missing, one option is to just replace the constructed rownames with "." (or whatever). The rownames for missing ID fields will always have a colon, underscore and forward slash. As long as your valid IDs don't have this you could do something like
names(vcf)[grepl(":", names(vcf)), fixed=TRUE] <- "."
Valerie
Thanks very much for your informative response. Are you open to patch submission on the VariantAnnotation package? In this particular case, make.unique() on just the variants generated by VariantAnnotation would given uniqueness without breaking the existing naming convention. I've done this in my own code, but it would be nice to offer such a solution for inclusion in VariantAnnotation itself.
Sure, always open to patches. Please include a unit test and send to me at valerie.obenchain@roswellpark.org.
Valerie
https://samtools.github.io/hts-specs/VCFv4.2.pdf
1.4.1.3 ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted)
>If you're getting an error that says VCF data must have unique rownames please let me know and show a small example.
Here is a small example showing the problem:
Round-tripping a valid VCF through VariantAnnotation results in an output VCF that not only is not identical to the input (not a problem since VCF does not define INFO column ordering or newlines), but does not conform to the VCF specifications since the IDs written are not unique.