Question

info(header(vcf)) remembers fields not imported: bugfix?

2

Entering edit mode

kevin.rue ▴ 350

@kevinrue-6757

Last seen 11 months ago

University of Oxford

Dear maintainer,

I notice a slightly strange behaviour of readVcf when ScanVcfParam(info=).

It seems that the imported VCF object somehow retains memories of the original VCF header (in the file), while the data was effectively dropped.

After import, the show method seems fine (header and INFO state that only the desired field was imported).

However, looking directly at info(header(vcf)) shows that the original fields are still there, while the info(vcf) show that only the desired field was imported. The show method must hide the extra fields based on the content of info(vcf), I guess. I have dug as far as the .showVCFSubclass method, but that's where it goes a little too decipher to read for me.

Many thanks,

Kevin

Let us take the example (from the vignette, plus a ScanVcfParam statement):

> svp <- ScanVcfParam(info = "NS")
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf2 <- readVcf(fl, "hg19", param = svp)

# Object seems fine (see info(vcf) and info(header(vcf))
> vcf2
class: CollapsedVCF
dim: 5 3
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: NS
info(header(vcf)):
      Number Type    Description               
   NS 1      Integer Number of Samples With Data
geno(vcf):
  SimpleList of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
      Number Type    Description     
   GT 1      String  Genotype        
   GQ 1      Integer Genotype Quality
   DP 1      Integer Read Depth      
   HQ 2      Integer Haplotype Quality

# Headers of non-imported fields are still present: needs fix (?)
> info(header(vcf2))
DataFrame with 6 rows and 3 columns
        Number        Type                 Description
   <character> <character>                 <character>
NS           1     Integer Number of Samples With Data
DP           1     Integer                 Total Depth
AF           A       Float            Allele Frequency
AA           1      String            Ancestral Allele
DB           0        Flag dbSNP membership, build 129
H2           0        Flag          HapMap2 membership

# Only the desired data was imported: good
> info(vcf2)
DataFrame with 5 rows and 1 column
                      NS
               <integer>
rs6054257              3
20:17330_T/A           3
rs6040355              2
20:1230237_T/.         3
microsat1              3

VariantAnnotation • 1.5k views

ADD COMMENT • link 8.7 years ago kevin.rue ▴ 350

2

Entering edit mode

Thank you for noting this discrepancy. The current implementation of displaying the original header information rather than the subset is deliberate in design. We felt that knowing all the possible information available from the object could be beneficial. Also some fields are closely related and it might be useful to know one or the other existed even if it was dropped.

ADD REPLY • link 8.7 years ago shepherl 4.1k

1

Entering edit mode

Thank you. I suspected so, and it makes sense. In my use case however, I prefer to keep header and data synchronised, for a few reasons:

My package does not need to know the possible fields in the original file
As a consequence, I keep a cleaner object without unnecessary grime (no offense! just my use case ^_^)
My package can then use rownames(header(info)) or colnames(info) reciprocally without inconsistent behaviour

Should you post this reply as an answer, if there is no plan to change VariantAnnotation? (I really don't mean this in a bad way, as I wrote above, it makes sense).

For my use case, I simply wrote an extra line of code to remove from info(header) the rows without data in info(). Not a big deal.

Thanks again for the clarification!

Kevin

ADD REPLY • link 8.7 years ago kevin.rue ▴ 350