When running readVCF with 1000 genomes data from 20130502 no genotype is returned.
I believe this to be because readVCF looks for a description of the FORMAT field in the header of the vcf file, but in the latest version of 1000 genomes, this is not present. It seems based upon my reading of the specification listed here
http://samtools.github.io/hts-specs/VCFv4.1.pdf
that the FORMAT field does not have to be specified in the header if it may change from variant to variant.
The vcf files however do posses the entry GT under the field FORMAT for each variant that has a genotype call. They are actually simplified, because as far as I can tell they only specify GT and no other field.
here is an example that ought to be reproducible:
region:
loc <- GRanges(1, IRanges(232164611, 233164611)) params <- ScanVcfParam(which=loc)
old 1000 genomes data:
file <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" vcf <- readVcf(TabixFile(file), "hg19", params) geno(vcf)
new 1000 genomes data:
file <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" vcf <- readVcf(TabixFile(file), "hg19", params) geno(vcf)
One may confirm that the new 1000 genomes data does in fact contain GT information with tabix via:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 1:232164611-233164611 > temp.vcf