Question

Annotate VCFs with Cosmic (ExAC etc.) fields

0

Entering edit mode

markus.riester ▴ 130

@markusriester-9875

Last seen 2.6 years ago

United States

Hi,

I'm trying to annotate VCFs with info fields stored in other VCFs, for example provided by Cosmic or ExAC. Since this is a very common workflow, I was wondering if I missed functionality already provided in Bioconductor.

The code below is the solution I came up with, which seems rather over-complicated. For example, is it possible to make ScanVcfParam check the ALT fields automatically?

data(package = "COSMIC.67")
data(cosmic_67, package = "COSMIC.67")
dbVcfFile <- system.file("vcf", "cosmic_67.vcf.gz", package = "COSMIC.67")

vcf <- readVcf(example.vcf.file, 'hg19')

.annotateVcf <- function(vcf, dbVcfFile, infoFields, infoTypes, infoDescriptions, infoPrefix="") {

    # make sure chromsome styles are the same
    dbSeqStyle <- seqlevelsStyle(headerTabix(
        TabixFile(dbVcfFile))$seqnames)

    vcfRenamedSL <- vcf
    if (!length(intersect(seqlevelsStyle(vcf), dbSeqStyle))) {
        seqlevelsStyle(vcfRenamedSL) <- dbSeqStyle[1]
    }

    # query variants
    db.vcf <- readVcf(dbVcfFile, genome=genome(vcf)[1],
        ScanVcfParam(which = rowRanges(vcfRenamedSL),
            info=infoFields,
            fixed="ALT",
            geno=NA
        )
    )
    ov <- match(rowRanges(vcfRenamedSL), rowRanges(db.vcf))

    # annotate header
    field <- paste0(infoPrefix,infoFields)
    newInfo <- DataFrame(
        Number=1, Type=infoTypes,
        Description=infoDescriptions,
        row.names=field)
    info(header(vcf)) <- rbind(info(header(vcf)), newInfo)

    # add info fields
    for (i in seq_along(infoFields)) {
        info(vcf)[[field[i]]] <- info(db.vcf)[[infoFields[1]]][ov]
    }

    vcf
}

vcfAnnotated <- .annotateVcf(vcf, dbVcfFile, "CNT", "Integer", "Number of Cosmic hits", "Cosmic.")

Update: updated code with suggestions.

Thank you,

Markus

variantannotation cosmic exac • 3.7k views

ADD COMMENT • link updated 8.3 years ago by Martin Morgan 25k • written 8.3 years ago by markus.riester ▴ 130

2

Entering edit mode

If you have your variants as VRanges objects, methods like match, %in%, %over% and friends also consider the alternative allele (see the help page):

'match(x)’: Like GRanges ‘match’, except matches on the combination of chromosome, start, width, and *alt*.

Personally, and just in case you are not exclusively bound to R for your entire analysis, I prefer running bcftools annotate externally for this type of task, especially when dealing with large VCF files. You can then import the annotated VCF into R/Bioc and process as normal.

ADD REPLY • link 8.3 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Thanks a lot Julian, I didn't know that "match" works on VRanges, but indeed does. And yes, the code is pretty slow for larger VCFs, is there something obvious I can do to speed it up?

ADD REPLY • link 8.3 years ago markus.riester ▴ 130

0

Entering edit mode

Avoid re-inventing the wheel, the tools doing what you want already exist. As suggested by Julian, bcftools annotate is one of those.

ADD REPLY • link 8.3 years ago WouterDeCoster ▴ 130

0

Entering edit mode

No need for patronizing. Upstream and downstream steps happen in R, so third party dependencies wouldn't be ideal either.

ADD REPLY • link 8.3 years ago markus.riester ▴ 130

0

Entering edit mode

Scanning your code, I would guess (a) if there are many ranges, then it is often faster to read-then-subset (e.g., subsetByOverlaps(db.vcf, rowRanges(vcfRenamedSL)), rather than using the which= argument to ScanVcfParam() and (b) in the for loop at the end, collect all results then do the info update (info(vcf) <- ...) only once to avoid copying the vcf object on each iteration. It can be very helpful to simply debug() and then step through the function to see where the slow points are.

Actually it seems like most of the for loop is captured without iteration as

field <- paste0(infoPrefix,infoFields)
newInfo <- DataFrame(
    Number=1, Type=infoTypes,
    Description=infoDescriptions,
    row.names=field)
info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
info(vcf)[[field]] <- NA

It seems like the remaining iteration can also be made to iterate only on small data

fld = info(vcf)[[field]]
for (i in seq_along(infoFields))
    fld[queryHits(ov)] <- info(db.vcf[subjectHits(ov)])[[infoFields[i]]]
info(vcf)[[field]] = fld

and even vectorized, but it would help to see a complete (toy) example of inputs and outputs.

ADD REPLY • link 8.3 years ago Martin Morgan 25k

0

Entering edit mode

Thanks a lot Martin! I updated the code. The bottleneck is the readVcf part, but I forgot to mention that these can be very large files and I assume they are bgzipped and tabixed. It's not terribly slow, but if you see ways for improving, that would be awesome.

ADD REPLY • link 8.3 years ago markus.riester ▴ 130

score 1 · Answer 1 · 2016-11-07

I wrote annotateVcf as follows, assuming that both vcf (the vcf to be annotated) and db.vcf were already in memory.

annotateVcf <- function(vcf, db.vcf, header) {
    if (!length(intersect(seqlevelsStyle(vcf), seqlevelsStyle(db.vcf))))
        seqlevelsStyle(vcfRenamedSL) <- seqlevelsStyle(db.vcf)[1]

    info(header(vcf)) <- rbind(info(header(vcf)), header)

    idx <- match(rowRanges(vcf), rowRanges(db.vcf))
     df <- info(db.vcf)[idx, infoFields, drop=FALSE]
    names(df) <- rownames(db.vcf_header)
    info(vcf)[, names(df)] <- df
    vcf
}

Maybe the only speedup is to avoid the iteration over infoFields, so the vcf is updated only once. I used it like this

infoFields <- "CNT"
infoPrefix <- "COSMIC"
dbVcfFile <- system.file(package = "COSMIC.67", "vcf", "cosmic_67.vcf.gz")

db.vcf <- readVcf(dbVcfFile, genome=genome(vcf)[1],
                  ScanVcfParam(info=infoFields, fixed="ALT", geno=NA))
header <- info(header(db.vcf))[infoFields, , drop=FALSE]
rownames(header) <- paste(infoPrefix, rownames(header), sep=".")

example.vcf.file <- system.file(package="VariantAnnotation",
                                "extdata", "chr22.vcf.gz")
vcf <- readVcf(example.vcf.file, 'hg19')
vcf <- annotateVcf(vcf, db.vcf, header)

and to deal with large files (needing annotation) I iterated through in chunks using VcfFile(yieldSize=) and GenomicFiles::reduceByYield()

out <- tempfile()
con <- file(out, open="a")              # append
reduceByYield(
    VcfFile(example.vcf.file, yieldSize=100000),
    readVcf,
    function(vcf) {
      vcf <- annotateVcf(vcf, db.vcf, header)
      writeVcf(vcf, con)
      length(vcf)
    })
close(con)