VariantAnnotation VCF read error
1
0
Entering edit mode
JK ▴ 10
@jk-6972
Last seen 9.5 years ago
United States

Hi,

 

By running

(vcf <- readVcf(vcffile, "hg38"))

 

in VariantAnnotation I get an error message

 

Error in DataFrame(Samples = seq_along(colnms), row.names = colnms) : duplicate row names

 

 

What may be causing this? I am not sure how I ended up with duplicate

names in the VCF file. My VCF file was generated by merging several

files using vcftools  vcf-merge function. May this be the problem?

 

Thank you!

>sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.12.2 Rsamtools_1.18.1         Biostrings_2.34.0        XVector_0.6.0            GenomicRanges_1.18.1     GenomeInfoDb_1.2.2      
 [7] IRanges_2.0.0            S4Vectors_0.4.0          GWASTools_1.12.0         gdsfmt_1.0.4             ncdf_1.6.8               Biobase_2.26.0          
[13] BiocGenerics_0.12.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.1    base64enc_0.1-2         BatchJobs_1.4           BBmisc_1.7              BiocParallel_1.0.0      biomaRt_2.22.0         
 [7] bitops_1.0-6            brew_1.0-6              BSgenome_1.34.0         checkmate_1.4           codetools_0.2-9         DBI_0.3.1              
[13] digest_0.6.4            DNAcopy_1.40.0          fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.0 GenomicFeatures_1.18.2 
[19] grid_3.1.1              GWASExactHW_1.01        iterators_1.0.7         lattice_0.20-29         lmtest_0.9-33           quantreg_5.05          
[25] quantsmooth_1.32.0      RCurl_1.95-4.3          RSQLite_0.11.4          rtracklayer_1.26.1      sandwich_2.3-2          sendmailR_1.2-1        
[31] SparseM_1.05            splines_3.1.1           stringr_0.6.2           survival_2.37-7         tools_3.1.1             XML_3.98-1.1           
[37] zlibbioc_1.12.0         zoo_1.7-11             
variantannotation readvcf • 1.9k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

Hi Jozsef,

The duplicate names are likely in the header FORMAT fields. If the file isn't too big, open it in an editor and look at the header tags marked with FORMAT. According to the vcf spec, the 'ID' key for a particular field should be unique. This means lines starting with INFO should have different 'ID' keys, the same applies for lines starting with FORMAT.

You can try reading in just the header information but if you have duplicate fields you may get an error:

> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> hdr <- scanVcfHeader(fl)

Another approach is to scan in the data then look at the names of the FORMAT (i.e., 'geno') fields:

> scn <- scanVcf(fl)
> names(scn[[1]]$GENO)
[1] "GT" "GQ" "DP" "HQ"

 


Valerie 

ADD COMMENT
0
Entering edit mode

Valerie,

Thank you.

This is what I have:

> names(scn[[1]]$GENO)
[1] "GT" "AD" "DP" "GQ" "PL"
> hdr
class: VCFHeader 
samples(18): sample AGP002_output_filtered_sample ... AGP046_output_filtered_sample AGP061_output_filtered_sample
meta(2): fileformat reference
fixed(1): FILTER
info(18): AF BaseQRankSum ... AC AN
geno(5): GT AD DP GQ PL

Jozsef

ADD REPLY
0
Entering edit mode

Can you send the file (or a small portion of it) to me off-line? (vobencha@fhcrc.org)

Valerie

ADD REPLY
0
Entering edit mode

Thanks for sending the file. (Testing done with VariantAnnotation 1.13.5 in devel.)

The duplicates were in the sample names, not the FORMAT field; sorry for steering you wrong there. You can view the samples by calling samples() on the header object:

hdr <- scanVcfHeader(fl)

> samples(hdr)
 [1] "sample"                        "AGP002_output_filtered_sample"
 [3] "AGP003_output_filtered_sample" "AGP004_output_filtered_sample"
 [5] "AGP004_output_filtered_sample" "AGP007_output_filtered_sample"
 [7] "AGP009_output_filtered_sample" "AGP012_output_filtered_sample"
 [9] "AGP013_output_filtered_sample" "AGP022_output_filtered_sample"
[11] "AGP025_output_filtered_sample" "AGP027_output_filtered_sample"
[13] "AGP029_output_filtered_sample" "AGP040_output_filtered_sample"
[15] "AGP041_output_filtered_sample" "AGP044_output_filtered_sample"
[17] "AGP046_output_filtered_sample" "AGP061_output_filtered_sample"


The duplicate entries are  'AGP004_output_filtered_sample'. Also, it looks like the first 'sample' is missing a prefix. Maybe it should have 'AGP001_output_filtered' in front?

You can more easily see the duplicates with a self-match (4th element matches both the 4th and 5th names):

> match(samples(hdr), samples(hdr))
 [1]  1  2  3  4  4  6  7  8  9 10 11 12 13 14 15 16 17 18

There are also many extra tabs at the end of most header lines. You can see these by inspecting the header with meta(). Because the VCF files are tab-delimited it would be good to remove these (i.e., they aren't treated as just white space).

> names(meta(hdr))
[1] "META"                                                          
[2] "FILTER\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"         
[3] "FORMAT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"         
...        

(FYI, the release version of VariantAnnotation returns a DataFrame for meta(hdr) instead of the DataFrameList you'll see in devel. It just a different packaging of the same information.)

Let me know if you still have problems after cleaning up the extra tabs and fixing the sample names.

Valerie

 

ADD REPLY
0
Entering edit mode

Thanks. That was an error in the script used to generate this file.

ADD REPLY

Login before adding your answer.

Traffic: 360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6