Question

VariantAnnotation - expand not working as intended for VCF

0

Entering edit mode

sebastian.hollizeck • 0

@sebastianhollizeck-12275

Last seen 7.7 years ago

Hi,

as I am currently working on creating a new database for the variants called in our cohort, I stumbled upon the expand function in VariantAnnotation for the vcf object

expand(x, ..., row.names = FALSE): Expand (unlist) the ALT column of a CollapsedVCF object to one row per ALT value. Variables with Number='A' have one value per alternate allele and are expanded accordingly. The 'AD' genotype field is expanded into REF/ALT pairs. For all other fields, the rows are replicated to match the elementNROWS of ALT.

this is exactly what I need.
Unfortunatly I dont think it works as intended

here is what i do

>vcf <- readVcf("/archive/sample_data/sy244/sy244pa.recalibrated_variants_vep.vcf.gz", "hg19")

> rowRanges(vcf)[1942]

GRanges object with 1 range and 5 metadata columns:
                       seqnames               ranges strand | paramRangeID
                          <Rle>            <IRanges>  <Rle> |     <factor>
  chr1:46084614_C/CTTT     chr1 [46084614, 46084614]      * |         <NA>
                                  REF                ALT      QUAL      FILTER
                       <DNAStringSet> <DNAStringSetList> <numeric> <character>
  chr1:46084614_C/CTTT              C         CTTT,CTTTT   2365.77        PASS
  -------
  seqinfo: 85 sequences from hg19 genome

> geno(vcf)$GT[1942]
[1] "1/2"

> geno(vcf)$AD[1942]
[[1]]
[1]  0 25 32

So you see this entry has two alts which is reflected in all the fields

then the vcf gets expanded

> evcf <- expand(vcf)
> rowRanges(evcf)[1942]
GRanges object with 1 range and 5 metadata columns:
      seqnames               ranges strand | paramRangeID            REF
         <Rle>            <IRanges>  <Rle> |     <factor> <DNAStringSet>
  [1]     chr1 [46084614, 46084614]      * |         <NA>              C
                 ALT      QUAL      FILTER
      <DNAStringSet> <numeric> <character>
  [1]           CTTT   2365.77        PASS
  -------
  seqinfo: 85 sequences from hg19 genome

So far so good, however

> geno(evcf)$AD[1942]
[1] 0

This should clearly be 0 25

So the second part of the depth is missing. This is also true for all the other splitted variants.

AM I doing something wrong, or is this a bug?

TY for your help

also:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stringr_1.1.0              RPostgreSQL_0.4-1         
 [3] DBI_0.5-1                  optparse_1.3.2            
 [5] data.table_1.10.0          ensemblVEP_1.12.0         
 [7] VariantAnnotation_1.18.7   Rsamtools_1.24.0          
 [9] Biostrings_2.40.2          XVector_0.12.1            
[11] SummarizedExperiment_1.2.3 Biobase_2.32.0            
[13] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7        
[15] IRanges_2.6.1              S4Vectors_0.10.3          
[17] BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8             AnnotationDbi_1.34.4    magrittr_1.5           
 [4] GenomicAlignments_1.8.4 zlibbioc_1.18.0         getopt_1.20.0          
 [7] BiocParallel_1.6.6      BSgenome_1.40.1         tools_3.3.1            
[10] digest_0.6.11           rtracklayer_1.32.2      bitops_1.0-6           
[13] RCurl_1.95-4.8          biomaRt_2.28.0          memoise_1.0.0          
[16] RSQLite_1.1             stringi_1.1.2           GenomicFeatures_1.24.5 
[19] XML_3.98-1.5

variantannotation • 2.2k views

ADD COMMENT • link updated 7.7 years ago by Michael Lawrence ★ 11k • written 7.7 years ago by sebastian.hollizeck • 0

0

Entering edit mode

We'll need to see the VCF header and ideally the line that causes the problem.

ADD REPLY • link 7.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

> vcf
class: CollapsedVCF 
dim: 61872 1 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 23 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ...
info(header(vcf)):
                       Number Type    Description                              
   AC                  A      Integer Allele count in genotypes, for each AL...
   AF                  A      Float   Allele Frequency, for each ALT allele,...
   AN                  1      Integer Total number of alleles in called geno...
   BaseQRankSum        1      Float   Z-score from Wilcoxon rank sum test of...
   ClippingRankSum     1      Float   Z-score From Wilcoxon rank sum test of...
   DP                  1      Integer Approximate read depth; some reads may...
   DS                  0      Flag    Were any of the samples downsampled?     
   END                 1      Integer Stop position of the interval            
   FS                  1      Float   Phred-scaled p-value using Fisher's ex...
   HaplotypeScore      1      Float   Consistency of the site with at most t...
   InbreedingCoeff     1      Float   Inbreeding coefficient as estimated fr...
   MLEAC               A      Integer Maximum likelihood expectation (MLE) f...
   MLEAF               A      Float   Maximum likelihood expectation (MLE) f...
   MQ                  1      Float   RMS Mapping Quality                      
   MQRankSum           1      Float   Z-score From Wilcoxon rank sum test of...
   NEGATIVE_TRAIN_SITE 0      Flag    This variant was used to build the neg...
   POSITIVE_TRAIN_SITE 0      Flag    This variant was used to build the pos...
   QD                  1      Float   Variant Confidence/Quality by Depth      
   ReadPosRankSum      1      Float   Z-score from Wilcoxon rank sum test of...
   SOR                 1      Float   Symmetric Odds Ratio of 2x2 contingenc...
   VQSLOD              1      Float   Log odds of being a true variant versu...
   culprit             1      String  The annotation which was the worst per...
   CSQ                 .      String  Consequence annotations from Ensembl V...
geno(vcf):
  SimpleList of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB
geno(header(vcf)):
          Number Type    Description                                           
   GT     1      String  Genotype                                              
   AD     .      Integer Allelic depths for the ref and alt alleles in the o...
   DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
   GQ     1      Integer Genotype Quality                                      
   MIN_DP 1      Integer Minimum DP observed within the GVCF block             
   PGT    1      String  Physical phasing haplotype information, describing ...
   PID    1      String  Physical phasing ID information, where each unique ...
   PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
   RGQ    1      Integer Unconditional reference genotype confidence, encode...
   SB     4      Integer Per-sample component statistics which comprise the ...
> header(vcf)
class: VCFHeader 
samples(1): sy244pa
meta(2): META contig
fixed(2): FILTER ALT
info(23): AC AF ... culprit CSQ
geno(10): GT AD ... RGQ SB

Do you also need the header in the file, as it is quite long?

and here the line (shorted=<...>)

bcftools view /archive/sample_data/sy244/sy244pa.recalibrated_variants_vep.vcf.gz |grep 46084614
chr1    46084614    .    C    CTTT,CTTTT    2365.77    PASS    AC=1,1;AF=0.5,0.5;AN=2;DP=59;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=29.85;SOR=1.238;VQSLOD=4.73;culprit=FS;CSQ=TTT|downstream_gene_variant<...>||||||rs35220217|691|1||insertion||HGNC|7644|CCDS55597.1|||||||||||||||||||||||||||||    GT:AD:DP:GQ:PL    1/2:0,25,32:57:99:2394,909,741,642,0,472

ADD REPLY • link 7.7 years ago sebastian.hollizeck • 0

0

Entering edit mode

The actual file header would be useful (in a pastebin or someting) so that I can construct a file to reproduce this.

ADD REPLY • link 7.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Here the header plus one of the lines, that show this behaviour

http://pastebin.com/f0hVctUG

ADD REPLY • link 7.7 years ago sebastian.hollizeck • 0

score 2 · Accepted Answer · 2017-02-02

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.8 years ago

United States

I think it's just the way you've indexed into the array. AD is stored as a 3D array, pos X sample X ref/alt. So to get all of the data for one position, use:

geno(evcf)$AD[1942,,]

ADD COMMENT • link 7.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

This is the solution, ty so much!
I just found no documentation what so ever to access expanded vcfs and expected them to work according to the normal vcfs. Especially since if I just use

geno(evcf)$AD

it returns just a one dimensional vector, which is kinda counter intuitive

ADD REPLY • link 7.7 years ago sebastian.hollizeck • 0

0

Entering edit mode

Really? I'm pretty sure it has to be an array for the code I posted to work. VCFs are complicated beasts. For example, your file is not taking advantage of the newer feature that indicates the AD field as having number 1 + alt count, so the collapsed VCF just takes the file at "face value" and stores each row as a list element (of potentially variable length). The expansion code treats AD specially and asserts the array structure (after checking that it fits).

ADD REPLY • link 7.7 years ago Michael Lawrence ★ 11k