Hi,
I'm trying to annotate VCFs with info fields stored in other VCFs, for example provided by Cosmic or ExAC. Since this is a very common workflow, I was wondering if I missed functionality already provided in Bioconductor.
The code below is the solution I came up with, which seems rather over-complicated. For example, is it possible to make ScanVcfParam check the ALT fields automatically?
data(package = "COSMIC.67") data(cosmic_67, package = "COSMIC.67") dbVcfFile <- system.file("vcf", "cosmic_67.vcf.gz", package = "COSMIC.67") vcf <- readVcf(example.vcf.file, 'hg19') .annotateVcf <- function(vcf, dbVcfFile, infoFields, infoTypes, infoDescriptions, infoPrefix="") { # make sure chromsome styles are the same dbSeqStyle <- seqlevelsStyle(headerTabix( TabixFile(dbVcfFile))$seqnames) vcfRenamedSL <- vcf if (!length(intersect(seqlevelsStyle(vcf), dbSeqStyle))) { seqlevelsStyle(vcfRenamedSL) <- dbSeqStyle[1] } # query variants db.vcf <- readVcf(dbVcfFile, genome=genome(vcf)[1], ScanVcfParam(which = rowRanges(vcfRenamedSL), info=infoFields, fixed="ALT", geno=NA ) ) ov <- match(rowRanges(vcfRenamedSL), rowRanges(db.vcf)) # annotate header field <- paste0(infoPrefix,infoFields) newInfo <- DataFrame( Number=1, Type=infoTypes, Description=infoDescriptions, row.names=field) info(header(vcf)) <- rbind(info(header(vcf)), newInfo) # add info fields for (i in seq_along(infoFields)) { info(vcf)[[field[i]]] <- info(db.vcf)[[infoFields[1]]][ov] } vcf } vcfAnnotated <- .annotateVcf(vcf, dbVcfFile, "CNT", "Integer", "Number of Cosmic hits", "Cosmic.")
Update: updated code with suggestions.
Thank you,
Markus
If you have your variants as
VRanges
objects, methods likematch
,%in%
,%over%
and friends also consider the alternative allele (see the help page):Personally, and just in case you are not exclusively bound to R for your entire analysis, I prefer running
bcftools annotate
externally for this type of task, especially when dealing with large VCF files. You can then import the annotated VCF into R/Bioc and process as normal.Thanks a lot Julian, I didn't know that "match" works on VRanges, but indeed does. And yes, the code is pretty slow for larger VCFs, is there something obvious I can do to speed it up?
Avoid re-inventing the wheel, the tools doing what you want already exist. As suggested by Julian, bcftools annotate is one of those.
No need for patronizing. Upstream and downstream steps happen in R, so third party dependencies wouldn't be ideal either.
Scanning your code, I would guess (a) if there are many ranges, then it is often faster to read-then-subset (e.g.,
subsetByOverlaps(db.vcf, rowRanges(vcfRenamedSL))
, rather than using thewhich=
argument toScanVcfParam(
) and (b) in the for loop at the end, collect all results then do the info update (info(vcf) <- ...
) only once to avoid copying the vcf object on each iteration. It can be very helpful to simply debug() and then step through the function to see where the slow points are.Actually it seems like most of the for loop is captured without iteration as
It seems like the remaining iteration can also be made to iterate only on small data
and even vectorized, but it would help to see a complete (toy) example of inputs and outputs.
Thanks a lot Martin! I updated the code. The bottleneck is the readVcf part, but I forgot to mention that these can be very large files and I assume they are bgzipped and tabixed. It's not terribly slow, but if you see ways for improving, that would be awesome.