Data from ##SAMPLE lines in VCF?
2
0
Entering edit mode
lvclark ▴ 10
@lvclark-17516
Last seen 4.3 years ago

I'm experimenting with custom VCF formats.  More specifically, some colleagues and I are working on establishing standards for polyploid data in non-model organisms.

I'd like to make use of the Sample field format as described in the current VCF specification part 1.4.8.  It especially seems useful for long-term data preservation, enabling us to put all sample metadata into the file.

I made the following dummy file to see if I could import data from ##SAMPLE lines.

##fileformat=VCFv4.3
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##META=<ID=Species,Type=String,Number=.,Description="Species name">
##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome">
##SAMPLE=<ID=Sam1,Species="Miscanthus sinensis",Ploidy="2x",Description="Sample1">
##SAMPLE=<ID=Sam2,Species="Miscanthus sacchariflorus",Ploidy="4x",Description="Sample2">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sam1    Sam2
Chr01    1000    S01_1000    G    A    100    PASS    DP=20    GT:AD    0/0:10,0    0/1:5,5

readVcf works, but I can't find the sample metadata anywhere in the vcf object.  It seems like colData would be the appropriate place to put it.  If I manually add columns to colData, writeVcf gives an error.

myvcf <- readVcf("testvcf.vcf")
colData(myvcf)
colData(myvcf)$Ploidy <- c("2x", "4x")
writeVcf(myvcf, file = "outvcf.vcf")

Is there something I'm missing?  Or if not, what is the possibility of this being implemented in VariantAnnotation at some point in the future?

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.26.1    Rsamtools_1.32.3            Biostrings_2.48.0          
 [4] XVector_0.20.0              SummarizedExperiment_1.10.1 DelayedArray_0.6.6         
 [7] BiocParallel_1.14.2         matrixStats_0.54.0          Biobase_2.40.0             
[10] GenomicRanges_1.32.7        GenomeInfoDb_1.16.0         IRanges_2.14.12            
[13] S4Vectors_0.18.3            BiocGenerics_0.26.0        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18             compiler_3.5.0           prettyunits_1.0.2        progress_1.2.0          
 [5] GenomicFeatures_1.32.2   bitops_1.0-6             tools_3.5.0              zlibbioc_1.26.0         
 [9] biomaRt_2.36.1           digest_0.6.17            bit_1.1-14               BSgenome_1.48.0         
[13] evaluate_0.11            RSQLite_2.1.1            memoise_1.1.0            lattice_0.20-35         
[17] pkgconfig_2.0.2          rlang_0.2.2              Matrix_1.2-14            rstudioapi_0.7          
[21] DBI_1.0.0                yaml_2.2.0               GenomeInfoDbData_1.1.0   rtracklayer_1.40.6      
[25] httr_1.3.1               stringr_1.3.1            knitr_1.20               hms_0.4.2               
[29] rprojroot_1.3-2          bit64_0.9-7              grid_3.5.0               R6_2.2.2                
[33] AnnotationDbi_1.42.1     XML_3.98-1.16            rmarkdown_1.10           blob_1.1.1              
[37] magrittr_1.5             GenomicAlignments_1.16.0 backports_1.1.2          htmltools_0.3.6         
[41] assertthat_0.2.0         stringi_1.1.7            RCurl_1.95-4.11          crayon_1.3.4 
variantannotation coldata • 3.7k views
ADD COMMENT
1
Entering edit mode
lvclark ▴ 10
@lvclark-17516
Last seen 4.3 years ago

Found it:

meta(header(myvcf))$SAMPLE
ADD COMMENT
0
Entering edit mode

Although, there is still an issue with writeVcf, and looking at the source code it seems to be that now I have another data frame called META which creates a conflict with the META data frame containing the VCF version number.

ADD REPLY
0
Entering edit mode

I can confirm this.  None of our example VCF in VariantAnnotation use the META header record type

(https://samtools.github.io/hts-specs/VCFv4.3.pdf section 1.4.8).  Additionally writeVcf-methods.Rd mentions Bioconductor 3.1 and the contemporaneous standard v4.2 of VCF spec.  It looks like there is some updating to be done in VariantAnnotation for writeVcf to work with your example.  NB -- one can't cut and paste your example VCF because tabs are not preserved.  This threw me for a bit. 

ADD REPLY
0
Entering edit mode

Whoops, sorry about the tabs going away!  And thank you for looking into this.

ADD REPLY
0
Entering edit mode

Thanks for the report. The 'META' name clash is unfortunate. I'll work on a solution and should have something in they next couple of days.

Valerie

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

VariantAnnotation 1.27.6 and Rsamtools 1.33.6 have been updated to support VCFv4.3.

I wanted to add some background to explain why we had the "META" DataFame in the first place. The meta-information lines in a VCF header are preceded by '##' and we can think of these key-value pairs as "simple" or "non-simple".

A simple pair would have a single value, e.g.,

##fileformat=VCFv4.3

A non-simple pair has values enclosed in '<>' with subsections such as 'ID', 'Number', 'Type' etc., e.g.,

##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germ>

Previously the simple pairs went into a DataFrame called "META" and the non-simple went into separate DataFrames, one per unique key. Now that meta-information lines can have a key called "META" we can end up with 2 DataFrames with the same name:

> hdr
class: VCFHeader
samples(0):
meta(4): META Tassel META SAMPLE
fixed(0):
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Naming the DataFrame "META" was (my) poor choice from the start since it doesn't represent all meta-information lines, just the simple ones. We could re-name this but we'll always run the risk that a new header line will have the same key name and we'll be back in the same situation.

I've decided to split the "META" DataFrame, making each row a separate DataFrame. This seems like the cleanest solution in that `meta()` now returns a DataFrame for each unique key in the header.

> hdr
class: VCFHeader
samples(0):
meta(4): fileformat Tassel META SAMPLE
fixed(0):
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Unfortunately this is not a completely backwards compatible solution. The package functions support both v4.2 or v4.3 but if users have code that extracts 'fileformat', 'fileDate' etc. from the old "META" DataFrame they will need to update their code with VariantAnnotation >= 1.27.6.

Because next release is less than 4 weeks away, I'm inclined to not port this to the current release (BioC 3.7) unless there is a strong need.

ADD COMMENT
0
Entering edit mode

I'm perfectly happy waiting a month for this (or installing the development version in the meantime).  The project that I'm working on in which I encountered this issue is very much in the early stages.

Thank you so much for fixing this so quickly!

ADD REPLY

Login before adding your answer.

Traffic: 390 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6