Question

Add Custom Annotations to Multi-Sample VCF with VariantAnnotation() package

0

Entering edit mode

summerela • 0

@summerela-7378

Last seen 8.5 years ago

Seattle, WA, United States

Hi-

I'm trying to figure out the best method for adding custom annotations to
multi-sample VCF files using the VariantAnnotation package.

I have an already functionally-annotated multi-sample VCF file that I would like to
match up with another VCF file containing different annotations (not multi-sample). Where the chr, start, end, ref, and alt fields are identical, I would like to add the annotation
from the custom vcf file to the existing annotations on the multi-file.

So far I've been grappling with pulling in each vcf and coercing them into a data frame (which is really no fun with DNAStringSetList objects), and attempting to manually match up the fields. However, the ALT field can contain a comma-separated list in both fields, which makes things trickier, and I just feel that there has to be a better way. I'd really appreciate any help or feedback; would findoverlap and Vranges work better?

Here's what I've tried so far:

Data frame matching:

##############
## customvcf ##
##############

humie_ref = "hg19"

##read in vcf with custom annotations
custom <- readVcf(custom_db, humie_ref)

##rename chromosomes to match with vcf files
custom <- renameSeqlevels(custom, c("1"="chr1"))

##make data frame of needed fields
cust.df = cbind(chr= seqlevels(custom), as.data.frame(ranges(custom), stringsToFactors=FALSE),
ref=as.character(values(rowData(custom))[["REF"]]), alt=values(rowData(custom))[["ALT"]], info(custom))
##convert alts col to character list for matching
cust.df$alt.value = as.character(cust.df$alt.value)

#############
## multi-vcf ##
#############

##read in everything but sample data for speediness
vcf_param = ScanVcfParam(samples=NA)

##read in multi-sample vcf
vcf <- readVcf(multi_vcf, humie_ref, param=vcf_param)

##data frame vcf info
vcf.df = cbind(chr= seqlevels(vcf), as.data.frame(ranges(vcf), stringsToFactors=FALSE), fixed(vcf), info(vcf))

vcf.df = cbind(chr= seqlevels(vcf), as.data.frame(ranges(vcf), stringsToFactors=FALSE),
ref=as.character(values(rowData(vcf))[["REF"]]), alt=values(rowData(vcf))[["ALT"]], info(vcf))

##convert alt alleles to character for matching
vcf.df$alt.value = as.character(vcf.df$alt.value)

###############
## match-vcf ##
###############
## Here's where I get a little stuck because no matter what I try, I end up with things like "AA|T" matching up to "G", etc.

##subset for all fields except alt
match1 = data.frame(merge(vcf.df, kav.df, by=c("chr", "start", "end", "ref")))

##keep matches with same alt allele
matches = unique(match1[unlist(sapply(match1$ALT.x, grep, match1$ALT.y, fixed=TRUE)),])

Please let me know if you'd also like me to include example output.

Thanks,
Summer

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] BiocInstaller_1.16.1 data.table_1.9.4 VariantAnnotation_1.12.9 Rsamtools_1.18.2
[5] Biostrings_2.34.1 XVector_0.6.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4
[9] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[5] Biobase_2.26.0 BiocParallel_1.0.3 biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.1 checkmate_1.5.1 chron_2.3-45
[13] codetools_0.2-10 DBI_0.3.1 digest_0.6.8 fail_1.2
[17] foreach_1.4.2 GenomicAlignments_1.2.1 GenomicFeatures_1.18.3 iterators_1.0.7
[21] packrat_0.4.3 plyr_1.8.1 Rcpp_0.11.3 RCurl_1.95-4.5
[25] reshape2_1.4 RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1
[29] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1 zlibbioc_1.12.0

variantannotation granges DNAStringSetList findoverlaps Vranges • 3.6k views

ADD COMMENT • link 10.1 years ago summerela • 0

score 2 · Accepted Answer · 2015-02-18

2

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

Hi Summer,

I think you want one ALT per row in your data frame so the vcf data needs to be expanded. Take a look at ?expand on a VCF object or try reading the data in with ?readVcfAsVRanges.

Once you've created your merged data frame (match1) compare the ALT fields directly:

match1$ALT.x == match1$ALT.y

This will give you a logical vector that you can subset on.

As a side note, DNAStringSet objects can be compared directly and you don't need to coerce to character for comparison. Here's an example of subsetting a VCF object on matching alleles:

Read in vcf files:

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf1 <- vcf2 <- readVcf(fl, "hg19")

Change ALT in vcf2 for sake of example:

> alt(vcf2)
DNAStringSetList of length 5
[[1]] A
[[2]] A
[[3]] G T
[[4]] 
[[5]] G GTCT

> alt(vcf2)[[4]] <- "TT"

> alt(vcf2)
DNAStringSetList of length 5
[[1]] A
[[2]] A
[[3]] G T
[[4]] TT
[[5]] G GTCT

Compares every ALT in the list:

> alt(vcf1) == alt(vcf2)
LogicalList of length 5
[[1]] TRUE
[[2]] TRUE
[[3]] TRUE TRUE
[[4]] FALSE
[[5]] TRUE TRUE

Return TRUE only if all match:

> idx <- all(alt(vcf1) == alt(vcf2))
> idx
[1]  TRUE  TRUE  TRUE FALSE  TRUE

Subset the VCF:

> dim(vcf1)
[1] 5 3
> dim(vcf1[idx, ])
[1] 4 3

Valerie

ADD COMMENT • link 10.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Valerie-

I can't thank you enough! I ended up using "any" instead of "all" to subset the vcf files so that I could keep rows where any of the alleles matched.

> ##read in test files
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf1 <- vcf2 <- readVcf(fl, "hg19")
> alt(vcf1)[[5]] <- "GTCT"
> ##read in test files
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf1 <- vcf2 <- readVcf(fl, "hg19")
> 
> ##change allele for example
> alt(vcf1)[[5]] <- "GTCT"
> 
> ##view alt alleles
> alt(vcf1)
DNAStringSetList of length 5
[[1]] A
[[2]] A
[[3]] G T
[[4]] 
[[5]] GTCT
> alt(vcf2)
DNAStringSetList of length 5
[[1]] A
[[2]] A
[[3]] G T
[[4]] 
[[5]] G GTCT
> 
> ##compare
> alt(vcf1) == alt(vcf2)
LogicalList of length 5
[[1]] TRUE
[[2]] TRUE
[[3]] TRUE TRUE
[[4]] TRUE
[[5]] FALSE TRUE
> 
> ##subset for rows where at least one allele matches
> idx <- any(alt(vcf1) == alt(vcf2))
> idx
[1] TRUE TRUE TRUE TRUE TRUE

ADD REPLY • link 10.1 years ago summerela • 0

0

Entering edit mode

I spoke too soon!

Using the method of comparing alt(vcf) and alt(ref) directly on compressed vcf files returns an error when trying to subset the vcf file, I'm assuming because the vectors are of differing lengths.

I tried the as below, and also tried first creating a subset of the vcf's by merging on chr,start,end,ref. Both methods generated the same error message:

> ##read in edited kaviar vcf and human ref
> ref <- readVcf("/Documents/snp_annotato.R/ref_db/kav_edit.vcf.gz", humie_ref)
> 
> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf_reg <- readVcf("./ref_db/cg_test.vcf.gz", humie_ref, param=vcf_param)
> 
> ##subset for matching alt alleles
> vcf_reg[any(alt(vcf_reg) == alt(ref)),]
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript is a logical vector with out-of-bounds TRUE values
In addition: Warning message:
In .Method(..., FUN = FUN, MoreArgs = MoreArgs, SIMPLIFY = SIMPLIFY,  :
  longer argument not a multiple of length of shorter

I also tried reading in the vcf files as Vranges, and came up with the strange matches highlighted below:

> ##read in edited ref vcf 
> kaviar <- readVcfAsVRanges("ref.vcf.gz", humie_ref)
Warning message:
In .vcf_usertag(map, tag, ...) :
  ScanVcfParam ‘geno’ fields not present: ‘AD’
> 
> ##rename chromosomes to match with vcf files
> ref <- renameSeqlevels(ref, c("1"="chr1"))
> 
> ##################################
> ## Gather VCF files to process  ## 
> ##################################
> ##data frame *.vcf.gz files in directory path
> vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE))
> 
> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf <- readVcfAsVRanges(vcf_path[1], humie_ref, param=vcf_param)
> 
> #################
> ## Match SNP's ##
> #################
> ##create data frames of info to match on
> vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)), 
+                     alt=alt(vcf), stringsAsFactors=FALSE)
> ref.df = data.frame(chr =as.character(seqnames(ref)), start = start(ref), end = end(ref), 
+                     ref = as.character(ref(ref)), alt=alt(ref), stringsAsFactors=FALSE)
> 
> ##merge based on all position fields except vcf
> col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
> 
> ##subset col_match by matching alt alleles
> col_match[any(col_match$alt.x == col_match$alt.y),]
    chr    start      end    ref  alt.x  alt.y
1  chr1 39998059 39998059      A      G      G
2  chr1 39998059 39998059      A      G      G
3  chr1 39998084 39998084      C      A      A
4  chr1 39998084 39998084      C      A      A
5  chr1 39998085 39998085      G      A      A
6  chr1 39998085 39998085      G      A      A
7  chr1 39998223 39998223      T      G      G
8  chr1 39998223 39998223      T      G      G
9  chr1 39998238 39998243 TGCGCG CGGGAA CGGGAA
10 chr1 39998238 39998243 TGCGCG CGGGAA CGGGAA
11 chr1 39998238 39998243 TGCGCG TGCACG CGGGAA
12 chr1 39998238 39998243 TGCGCG TGCACG CGGGAA
13 chr1 39998250 39998250      A      G      G
14 chr1 39998250 39998250      A      G      G
15 chr1 39998335 39998336     CG     TA     TA
16 chr1 39998335 39998336     CG     TA     TA
33 chr1 39998416 39998416      G      A      A
34 chr1 39998416 39998416      G      A      A
35 chr1 39998421 39998423    CGT    CGG    TGC
36 chr1 39998421 39998423    CGT    CGG    TGC
37 chr1 39998421 39998423    CGT    TGC    TGC
38 chr1 39998421 39998423    CGT    TGC    TGC

ADD REPLY • link 10.1 years ago summerela • 0

1

Entering edit mode

The direct comparison of alt values between 2 vcf files only makes sense if the vcf files are the same length and have the same positions (so we are comparing apples with apples). Creating the merged data frame by position handles the case where the vcf files have different positions.

You said you got the same error (Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : ....) with the merged data frame. What do the alt.x and alt.y values in your data frame look like? When you do the '==' comparison between the alts, what is the length of the resulting logical vector? Is it a vector or List object? Looking at these answers should help you understand where things went wrong.

In the VRanges code it looks like you've read in 'vcf' and 'kaviar' with readVcfAsVRanges(). You then create a data frame from 'vcf' and 'ref' but 'ref' was read in with readVcf() and not readVcfAsVRanges(). This means the data in 'ref' is not expanded to one alt per row.

Valerie

ADD REPLY • link 10.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie-

I was trying to show two methods above of reading the same file in with readVcf and readVcfAsVRanges, sorry if I did make that very obvious!

You are absolutely right about the direct comparison of 2 VCF files. Here is the final code I came up with. Now if I can just get the info field imported as mentioned in my other post, which I'll link here in case it's helpful to anyone, (https://support.bioconductor.org/p/65074/#65141), this will be good to go! Exciting! Thanks so much for your help.

##list required packages
##installation issues = install through biocLite()
packages <- c("VariantAnnotation", "IRanges", "stringi")
##create function to install and/or load packages 
load_reqs <- function(lib_arg){
  if (require(lib_arg, character.only=TRUE)== TRUE){
    library(lib_arg, character.only=TRUE)
  } else {
    source("http://www.bioconductor.org/biocLite.R") 
    biocLite(lib_arg) 
  }
}
##apply function to list of required packages
lapply(packages, load_reqs)

> #########################
> ## Read in Kaviar File ##
> #########################
> ##RUNONCE only use when updated kaviar db
> ##replace <CN0> with a period so we dont' break R
> #zcat < Kav_ranged.vcf.gz| awk -F $'\t' 'BEGIN {OFS = FS} {gsub("\<CN0\>", ".", $5);print}' | bgzip > kav_edit.vcf.gz
> 
> ##set reference 
> humie_ref = "hg19"
> 
> ##read in kaviar vcf as VRange object
> kaviar <- readVcfAsVRanges("/Documents/snp_annotato.R/ref_db/Kav_ranged.vcf.gz", humie_ref)
Warning message:
In .vcf_usertag(map, tag, ...) :
  ScanVcfParam ‘geno’ fields not present: ‘AD’
> 
> ##rename chromosomes to match with vcf files
> kaviar <- renameSeqlevels(kaviar, c("1"="chr1"))
> 
> ##################################
> ## Gather VCF files to process  ## 
> ##################################
> ##data frame *.vcf.gz files in directory path
> #vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE))
> 
> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf <- readVcfAsVRanges("./snp_annotato.R/ref_db/cg_test.vcf.gz", humie_ref, param=vcf_param)
> #################
> ## Match SNP's ##
> #################
> ##create data frames of info to match on
> vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)), 
+                     alt=alt(vcf), stringsAsFactors=FALSE)
> kav.df = data.frame(chr =as.character(seqnames(kaviar)), start = start(kaviar), end = end(kaviar), 
+                     ref = as.character(ref(kaviar)), alt=alt(kaviar), stringsAsFactors=FALSE)
> 
> ##merge based on all positional fields except vcf
> col_match = data.frame(merge(vcf.df, kav.df, by=c("chr", "start", "end", "ref")))
> ##split each alt column by comma and bind together
> M1 <- stri_list2matrix(sapply(col_match$alt.x,strsplit,','))
> M2 <- stri_list2matrix(sapply(col_match$alt.y,strsplit,','))
> M <- rbind(M1,M2)
> 
> ##compare results
> result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
> 
> ##add results column to col_match df for checking/subsetting
> col_match$match = result
> 
> col_match
    chr    start      end    ref  alt.x  alt.y  match
1  chr2 39998078 39998999      A      G      G      G
2  chr2 39998555 39998556      A      G      G      G
3  chr2 39998999 40000000      C      A      A      A

ADD REPLY • link 10.1 years ago summerela • 0