By running
(vcf <- readVcf(vcffile, "hg38"))
in VariantAnnotation I get an error message
Error in DataFrame(Samples = seq_along(colnms), row.names = colnms) : duplicate row names
What may be causing this? I am not sure how I ended up with duplicate
names in the VCF file. My VCF file was generated by merging several
files using vcftools vcf-merge function. May this be the problem?
Thank you!
>sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.12.2 Rsamtools_1.18.1 Biostrings_2.34.0 XVector_0.6.0 GenomicRanges_1.18.1 GenomeInfoDb_1.2.2
[7] IRanges_2.0.0 S4Vectors_0.4.0 GWASTools_1.12.0 gdsfmt_1.0.4 ncdf_1.6.8 Biobase_2.26.0
[13] BiocGenerics_0.12.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 BiocParallel_1.0.0 biomaRt_2.22.0
[7] bitops_1.0-6 brew_1.0-6 BSgenome_1.34.0 checkmate_1.4 codetools_0.2-9 DBI_0.3.1
[13] digest_0.6.4 DNAcopy_1.40.0 fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.0 GenomicFeatures_1.18.2
[19] grid_3.1.1 GWASExactHW_1.01 iterators_1.0.7 lattice_0.20-29 lmtest_0.9-33 quantreg_5.05
[25] quantsmooth_1.32.0 RCurl_1.95-4.3 RSQLite_0.11.4 rtracklayer_1.26.1 sandwich_2.3-2 sendmailR_1.2-1
[31] SparseM_1.05 splines_3.1.1 stringr_0.6.2 survival_2.37-7 tools_3.1.1 XML_3.98-1.1
[37] zlibbioc_1.12.0 zoo_1.7-11
Valerie,
Thank you.
This is what I have:
> names(scn[[1]]$GENO) [1] "GT" "AD" "DP" "GQ" "PL" > hdr class: VCFHeader samples(18): sample AGP002_output_filtered_sample ... AGP046_output_filtered_sample AGP061_output_filtered_sample meta(2): fileformat reference fixed(1): FILTER info(18): AF BaseQRankSum ... AC AN geno(5): GT AD DP GQ PL
Jozsef
Can you send the file (or a small portion of it) to me off-line? (vobencha@fhcrc.org)
Valerie
Thanks for sending the file. (Testing done with VariantAnnotation 1.13.5 in devel.)
The duplicates were in the sample names, not the FORMAT field; sorry for steering you wrong there. You can view the samples by calling samples() on the header object:
The duplicate entries are 'AGP004_output_filtered_sample'. Also, it looks like the first 'sample' is missing a prefix. Maybe it should have 'AGP001_output_filtered' in front?
You can more easily see the duplicates with a self-match (4th element matches both the 4th and 5th names):
There are also many extra tabs at the end of most header lines. You can see these by inspecting the header with meta(). Because the VCF files are tab-delimited it would be good to remove these (i.e., they aren't treated as just white space).
(FYI, the release version of VariantAnnotation returns a DataFrame for meta(hdr) instead of the DataFrameList you'll see in devel. It just a different packaging of the same information.)
Let me know if you still have problems after cleaning up the extra tabs and fixing the sample names.
Valerie
Thanks. That was an error in the script used to generate this file.