I've been doing a bit more searching online for using readVcf with 1000 Genome data, and it appears to be specific to the files I'm using... am I using the wrong files perhaps? I tried with an example vcf posted on this site and it worked:
Thanks Valerie, you're a life saver! Is there any way to parse the data without explicitly altering the physical files, for example by loading it into a TabixFile and altering it there? Otherwise it would mean downloading all 1000 Genomes chromosomes to disk, in total about ~16 GB, and I'd like to avoid doing this if possible.
The original 1000 genomes files should have the FORMAT lines. If not, that should be reported to them. It sounds like you downloaded subsets with tabix on the command line. Did something go wrong in that step? What command did you use?
A Bioconductor TabixFile object is a reference to a file on disk so you can't modify data through that. You could add the FORMAT lines using an editor such as vim, emacs etc. but really we should understand why they are missing.
I did download subsets through tabix, and it seemed to work -- the fact that I get the issue when I reference the file directly from their ftp leads me to believe that the issue is on their end though. See above:
Yes, it does look like 1000 Genomes is generating these without FORMAT header lines.
readVcf() can be modified to read all FORMAT fields (regardless of header) and use the default data types in section 1.4.2 of the spec. If a FORMAT header line is provided it will override the default.
I'll put this on the TODO and post back here when it's done.
Valerie
Update on 8/13/15:
Attempting to parse the fields in 1.4.2 without header lines can be costly. Space must be allocated for each field, many of which may not be present in any of the variants. Additionally, many fields vary in length and data type which is specific to the file data; without this information the field can only be returned as an unparsed character vector which, in most cases, is not useful.
As an alternative, 'verbose' was added to readVcf(). Using 'verbose=TRUE' prints the fields found in the header and should help explain less data than expected info(vcf) or geno(vcf). I've also added a few man page examples.
readVcf(file, "genome", verbose=TRUE)
The 'verbose' option is available in devel (VariantAnnotation >= 1.15.22).
Sorry, not sure what happened with the link. https://www.dropbox.com/s/qwa0xoaksnnye4s/rs11780156.genotypes.vcf.gz?dl=0 should hopefully work!
I've been doing a bit more searching online for using readVcf with 1000 Genome data, and it appears to be specific to the files I'm using... am I using the wrong files perhaps? I tried with an example vcf posted on this site and it worked:
That link worked. Thanks.
The problem is that the file is missing FORMAT header lines. Add this to your header and GT reads in fine:
readVcf()
uses the header information to parse data into correct types. Also, they are expected as part of the VCF 4.2 spec (section 1.2.4):http://samtools.github.io/hts-specs/VCFv4.2.pdf
Valerie
Thanks Valerie, you're a life saver! Is there any way to parse the data without explicitly altering the physical files, for example by loading it into a TabixFile and altering it there? Otherwise it would mean downloading all 1000 Genomes chromosomes to disk, in total about ~16 GB, and I'd like to avoid doing this if possible.
The original 1000 genomes files should have the FORMAT lines. If not, that should be reported to them. It sounds like you downloaded subsets with tabix on the command line. Did something go wrong in that step? What command did you use?
A Bioconductor TabixFile object is a reference to a file on disk so you can't modify data through that. You could add the FORMAT lines using an editor such as vim, emacs etc. but really we should understand why they are missing.
Valerie
I did download subsets through tabix, and it seemed to work -- the fact that I get the issue when I reference the file directly from their ftp leads me to believe that the issue is on their end though. See above:
Yes, it does look like 1000 Genomes is generating these without FORMAT header lines.
readVcf() can be modified to read all FORMAT fields (regardless of header) and use the default data types in section 1.4.2 of the spec. If a FORMAT header line is provided it will override the default.
http://samtools.github.io/hts-specs/VCFv4.2.pdf
I'll put this on the TODO and post back here when it's done.
Valerie
Update on 8/13/15:
Attempting to parse the fields in 1.4.2 without header lines can be costly. Space must be allocated for each field, many of which may not be present in any of the variants. Additionally, many fields vary in length and data type which is specific to the file data; without this information the field can only be returned as an unparsed character vector which, in most cases, is not useful.
As an alternative, 'verbose' was added to
readVcf()
. Using 'verbose=TRUE' prints the fields found in the header and should help explain less data than expected info(vcf) or geno(vcf). I've also added a few man page examples.The 'verbose' option is available in devel (VariantAnnotation >= 1.15.22).
Valerie
I got in touch with 1000 Genomes, and they have now resolved the issue on their end. Thanks for all your help, Valerie.
Great. Thanks for contacting them.