Question

ReadVCF not reading samples

0

Entering edit mode

askates ▴ 10

@askates-7793

Last seen 9.8 years ago

United Kingdom

Hi, I'm trying to read a VCF file I have using readVcf, however the following occurs:

> fl <- file.path('tmp', 'vcf', 'rs11780156.genotypes.vcf.gz')
> vcf <- readVcf(fl, 'hg19')
> scanVcfHeader(fl)
class: VCFHeader
samples(190): HG00096 HG00097 ... NA12889 NA12890
meta(2): META contig
fixed(2): FILTER ALT
info(27): CIEND CIPOS ... EX_TARGET MULTI_ALLELIC
geno(0):
> dim(vcf)
[1] 13161     0
> geno(vcf)
List of length 0
names(0):

I've looked at the file itself, and it definitely has the sample genotypes in it, so I'm not sure why geno(vcf) is empty.

The file in question is at https://www.dropbox.com/s/qwa0xoaksnnye4s/rs11780156.genotypes.vcf.gz?dl=0 (fixed link) however it seems to apply to any that I try and load via readVcf. These are just VCF I files I downloaded from 1000 Genomes using tabix.

readvcf vcf variantannotation • 3.2k views

ADD COMMENT • link 9.8 years ago askates ▴ 10

score 0 · Answer 1 · 2015-05-22

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.1 years ago

United States

Hi,

I get an error when I try to access the dropbox link ...?

Also, please show the output of your sessionInfo(), e.g.,

> sessionInfo()
R version 3.2.0 Patched (2015-05-05 r68332)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora 21 (Twenty One)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.15.9   Rsamtools_1.21.5          
 [3] Biostrings_2.37.2          XVector_0.9.1             
 [5] SummarizedExperiment_0.1.5 Biobase_2.29.1            
 [7] GenomicRanges_1.21.12      GenomeInfoDb_1.5.3        
 [9] IRanges_2.3.8              S4Vectors_0.7.2           
[11] BiocGenerics_0.15.0       

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.31.9    GenomicAlignments_1.5.9 zlibbioc_1.15.0        
 [4] BiocParallel_1.3.12     BSgenome_1.37.1         tools_3.2.0            
 [7] DBI_0.3.1               lambda.r_1.1.7          futile.logger_1.4.1    
[10] rtracklayer_1.29.6      futile.options_1.0.0    bitops_1.0-6           
[13] RCurl_1.95-4.6          biomaRt_2.25.1          RSQLite_1.0.0          
[16] GenomicFeatures_1.21.5  XML_3.98-1.1

Valerie

ADD COMMENT • link 9.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Sorry, not sure what happened with the link. https://www.dropbox.com/s/qwa0xoaksnnye4s/rs11780156.genotypes.vcf.gz?dl=0 should hopefully work!

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices
[6] utils     datasets  methods   base     

other attached packages:
 [1] snpStats_1.18.0          Matrix_1.2-0            
 [3] survival_2.38-1          VariantAnnotation_1.14.1
 [5] Rsamtools_1.20.2         Biostrings_2.36.1       
 [7] XVector_0.8.0            GenomicRanges_1.20.3    
 [9] GenomeInfoDb_1.4.0       IRanges_2.2.1           
[11] S4Vectors_0.6.0          BiocGenerics_0.14.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.30.1    splines_3.2.0          
 [3] zlibbioc_1.14.0         GenomicAlignments_1.4.1
 [5] BiocParallel_1.2.1      lattice_0.20-31        
 [7] BSgenome_1.36.0         tools_3.2.0            
 [9] grid_3.2.0              Biobase_2.28.0         
[11] DBI_0.3.1               lambda.r_1.1.7         
[13] futile.logger_1.4.1     rtracklayer_1.28.2     
[15] futile.options_1.0.0    bitops_1.0-6           
[17] RCurl_1.95-4.6          biomaRt_2.24.0         
[19] RSQLite_1.0.0           GenomicFeatures_1.20.1 
[21] XML_3.98-1.1

I've been doing a bit more searching online for using readVcf with 1000 Genome data, and it appears to be specific to the files I'm using... am I using the wrong files perhaps? I tried with an example vcf posted on this site and it worked:

> fl1 <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz"
> hdr1 <- scanVcfHeader(fl1)
> hdr1
class: VCFHeader 
samples(174): HG00553 HG00554 ... NA19982
  NA20414
meta(1): META
fixed(0):
info(6): AF DP ... AFR_R2 ASN_R2
geno(7): GT AD ... GD OG
> 
> fl2 <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
> hdr2 <- scanVcfHeader(fl2)
> hdr2
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143
  NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(27): CIEND CIPOS ... EX_TARGET
  MULTI_ALLELIC
geno(0):

ADD REPLY • link 9.8 years ago askates ▴ 10

1

Entering edit mode

That link worked. Thanks.

The problem is that the file is missing FORMAT header lines. Add this to your header and GT reads in fine:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

readVcf() uses the header information to parse data into correct types. Also, they are expected as part of the VCF 4.2 spec (section 1.2.4):

http://samtools.github.io/hts-specs/VCFv4.2.pdf

Valerie

ADD REPLY • link 9.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Valerie, you're a life saver! Is there any way to parse the data without explicitly altering the physical files, for example by loading it into a TabixFile and altering it there? Otherwise it would mean downloading all 1000 Genomes chromosomes to disk, in total about ~16 GB, and I'd like to avoid doing this if possible.

ADD REPLY • link 9.8 years ago askates ▴ 10

0

Entering edit mode

The original 1000 genomes files should have the FORMAT lines. If not, that should be reported to them. It sounds like you downloaded subsets with tabix on the command line. Did something go wrong in that step? What command did you use?

A Bioconductor TabixFile object is a reference to a file on disk so you can't modify data through that. You could add the FORMAT lines using an editor such as vim, emacs etc. but really we should understand why they are missing.

Valerie

ADD REPLY • link 9.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

I did download subsets through tabix, and it seemed to work -- the fact that I get the issue when I reference the file directly from their ftp leads me to believe that the issue is on their end though. See above:

> fl2 <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
> hdr2 <- scanVcfHeader(fl2)
> hdr2
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143
  NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(27): CIEND CIPOS ... EX_TARGET
  MULTI_ALLELIC
geno(0):

ADD REPLY • link 9.8 years ago askates ▴ 10

0

Entering edit mode

Yes, it does look like 1000 Genomes is generating these without FORMAT header lines.

readVcf() can be modified to read all FORMAT fields (regardless of header) and use the default data types in section 1.4.2 of the spec. If a FORMAT header line is provided it will override the default.

http://samtools.github.io/hts-specs/VCFv4.2.pdf

I'll put this on the TODO and post back here when it's done.

Valerie

Update on 8/13/15:

Attempting to parse the fields in 1.4.2 without header lines can be costly. Space must be allocated for each field, many of which may not be present in any of the variants. Additionally, many fields vary in length and data type which is specific to the file data; without this information the field can only be returned as an unparsed character vector which, in most cases, is not useful.

As an alternative, 'verbose' was added to readVcf(). Using 'verbose=TRUE' prints the fields found in the header and should help explain less data than expected info(vcf) or geno(vcf). I've also added a few man page examples.

readVcf(file, "genome", verbose=TRUE)

The 'verbose' option is available in devel (VariantAnnotation >= 1.15.22).

Valerie

ADD REPLY • link 9.8 years ago • updated 9.5 years ago Valerie Obenchain ★ 6.8k

1

Entering edit mode

I got in touch with 1000 Genomes, and they have now resolved the issue on their end. Thanks for all your help, Valerie.