Hi,
as I am currently working on creating a new database for the variants called in our cohort, I stumbled upon the expand function in VariantAnnotation for the vcf object
expand(x, ..., row.names = FALSE): Expand (unlist) the ALT column of a CollapsedVCF object to one row per ALT value. Variables with Number='A' have one value per alternate allele and are expanded accordingly. The 'AD' genotype field is expanded into REF/ALT pairs. For all other fields, the rows are replicated to match the elementNROWS of ALT.
this is exactly what I need.
Unfortunatly I dont think it works as intended
here is what i do
>vcf <- readVcf("/archive/sample_data/sy244/sy244pa.recalibrated_variants_vep.vcf.gz", "hg19") > rowRanges(vcf)[1942] GRanges object with 1 range and 5 metadata columns: seqnames ranges strand | paramRangeID <Rle> <IRanges> <Rle> | <factor> chr1:46084614_C/CTTT chr1 [46084614, 46084614] * | <NA> REF ALT QUAL FILTER <DNAStringSet> <DNAStringSetList> <numeric> <character> chr1:46084614_C/CTTT C CTTT,CTTTT 2365.77 PASS ------- seqinfo: 85 sequences from hg19 genome > geno(vcf)$GT[1942] [1] "1/2" > geno(vcf)$AD[1942] [[1]] [1] 0 25 32
So you see this entry has two alts which is reflected in all the fields
then the vcf gets expanded
> evcf <- expand(vcf) > rowRanges(evcf)[1942] GRanges object with 1 range and 5 metadata columns: seqnames ranges strand | paramRangeID REF <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> [1] chr1 [46084614, 46084614] * | <NA> C ALT QUAL FILTER <DNAStringSet> <numeric> <character> [1] CTTT 2365.77 PASS ------- seqinfo: 85 sequences from hg19 genome So far so good, however > geno(evcf)$AD[1942] [1] 0
This should clearly be 0 25
So the second part of the depth is missing. This is also true for all the other splitted variants.
AM I doing something wrong, or is this a bug?
TY for your help
also:
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] stringr_1.1.0 RPostgreSQL_0.4-1 [3] DBI_0.5-1 optparse_1.3.2 [5] data.table_1.10.0 ensemblVEP_1.12.0 [7] VariantAnnotation_1.18.7 Rsamtools_1.24.0 [9] Biostrings_2.40.2 XVector_0.12.1 [11] SummarizedExperiment_1.2.3 Biobase_2.32.0 [13] GenomicRanges_1.24.3 GenomeInfoDb_1.8.7 [15] IRanges_2.6.1 S4Vectors_0.10.3 [17] BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.8 AnnotationDbi_1.34.4 magrittr_1.5 [4] GenomicAlignments_1.8.4 zlibbioc_1.18.0 getopt_1.20.0 [7] BiocParallel_1.6.6 BSgenome_1.40.1 tools_3.3.1 [10] digest_0.6.11 rtracklayer_1.32.2 bitops_1.0-6 [13] RCurl_1.95-4.8 biomaRt_2.28.0 memoise_1.0.0 [16] RSQLite_1.1 stringi_1.1.2 GenomicFeatures_1.24.5 [19] XML_3.98-1.5
We'll need to see the VCF header and ideally the line that causes the problem.
Do you also need the header in the file, as it is quite long?
and here the line (shorted=<...>)
The actual file header would be useful (in a pastebin or someting) so that I can construct a file to reproduce this.
Here the header plus one of the lines, that show this behaviour
http://pastebin.com/f0hVctUG