Hello-
It's me again. I am trying to read in the kaviar database (vcf format) using VariantAnnotation()'s readVcfAsVRanges. It reads in the file, but gives me an error and does not include any metadata information.
Here is what the VCF looks like:
##fileformat=VCFv4.1 ##fileDate=20150211 ##source=bin/makeVCF.pl ##reference=Kaviar-141127 (hg19) ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in data sources"> ##INFO=<ID=END,Number=.,Type=Integer,Description="End position"> ##INFO=<ID=DS,Number=A,Type=String,Description="Data Sources containing allele"> ##contig=<ID=1,assembly="hg19"> #CHROM POS ID REF ALT QUAL FILTER INFO 1 39998059 rs79139830 A G . . AF=0.0003573;AC=6;AN=16794;DS=phase3-ASW|phase3-LWK|phase3-PUR|phase3-YRI 1 39998084 rs115898201 C A . . AF=0.0014291;AC=24;AN=16794;DS=phase3-ACB|phase3-ASW|phase3-ESN|phase3-GWD|phase3-MSL|phase3-YRI 1 39998085 rs189533208 G A . . AF=5.95e-05;AC=1;AN=16794;DS=phase3-MXL 1 39998121 1:39998121_G/A G A . . AF=5.95e-05;AC=1;AN=16794;DS=GS000012713 1 39998137 rs555254194 T C . . AF=5.95e-05;AC=1;AN=16794;DS=phase3-YRI 1 39998147 1:39998147_T/TA T TA . . AF=5.95e-05;AC=1;AN=16794;DS=GS000015708 1 39998158 1:39998158_C/CA C CA . . AF=5.95e-05;AC=1;AN=16794;DS=GS000009931 1 39998160 rs182185201 A C . . AF=0.0007741;AC=13;AN=16794;DS=Malay|UK10K|phase3-CDX|phase3-CHS|phase3-KHV|phase3-STU 1 39998190 1:39998190_G/A G A . . AF=5.95e-05;AC=1;AN=16794;DS=UK10K
Here is my command, and the outcome:
> ##read in edited kaviar vcf > kaviar <- readVcfAsVRanges(kaviar, humie_ref) Warning message: In .vcf_usertag(map, tag, ...) : ScanVcfParam ‘geno’ fields not present: ‘AD’ > kaviar VRanges object with 242 ranges and 0 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepth sampleNames softFilterMatrix <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle> <matrix> [1] 1 [39998059, 39998059] + A G <NA> <NA> <NA> <NA> [2] 1 [39998084, 39998084] + C A <NA> <NA> <NA> <NA> [3] 1 [39998085, 39998085] + G A <NA> <NA> <NA> <NA> [4] 1 [39998121, 39998121] + G A <NA> <NA> <NA> <NA> [5] 1 [39998137, 39998137] + T C <NA> <NA> <NA> <NA> ... ... ... ... ... ... ... ... ... ... ... [238] 1 [39999275, 39999275] + T A <NA> <NA> <NA> <NA> [239] 1 [39999288, 39999288] + A T <NA> <NA> <NA> <NA> [240] 1 [39999318, 39999319] + AG A <NA> <NA> <NA> <NA> [241] 1 [39999319, 39999320] + GG G <NA> <NA> <NA> <NA> [242] 1 [39999330, 39999330] + A G <NA> <NA> <NA> <NA> ------- seqinfo: 1 sequence from hg19 genome; no seqlengths hardFilters: NULL
Thanks in advance for your help!
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] stringi_0.3-1 BiocInstaller_1.16.1 VariantAnnotation_1.12.9 Rsamtools_1.18.2 Biostrings_2.34.1 XVector_0.6.0 [7] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9 Biobase_2.26.0 BiocParallel_1.0.3 [7] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 BSgenome_1.34.1 checkmate_1.5.1 codetools_0.2-10 [13] DBI_0.3.1 digest_0.6.8 fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.1 GenomicFeatures_1.18.3 [19] iterators_1.0.7 packrat_0.4.3 RCurl_1.95-4.5 RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1 [25] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1 zlibbioc_1.12.0
Hi,
Great. It looks like you have everything sorted out. Just a couple of FYIs,
- VRangesScanVcfParam() is deprecated so it's best to use ScanVcfParam()
- In a previous post you were using 'metadata' as an argument to ScanVcfParam which is not a valid. Only 'info', 'geno', 'samples' and 'fixed' are valid arguments. The man page ?ScanVcfParam describes the valid fields. Above you are using 'info' in VRangesScanVcfParam() so you've probably figured that out, I just wanted to confirm.
- 'AD' (allelic depth) is a genotype or FORMAT field. The warning you saw with readVcfAsVRanges() was a warning, not an error. The VRanges class is designed to hold variants and by default readVcfAsVRanges() tries to read in the AD field. It's ok if the file doesn't have the field.
- If you haven't tried it yet, scanVcfHeader(file) is useful to get a listing of all fields in the file. If you do
geno(scanVcfHeader(Kav_ranged.vcf.gz))
you'll see a DataFrame of geno information and 'AD' will not be listed.
Valerie