Hi-
I'm trying to figure out the best method for adding custom annotations to
multi-sample VCF files using the VariantAnnotation package.
I have an already functionally-annotated multi-sample VCF file that I would like to
match up with another VCF file containing different annotations (not multi-sample). Where the chr, start, end, ref, and alt fields are identical, I would like to add the annotation
from the custom vcf file to the existing annotations on the multi-file.
So far I've been grappling with pulling in each vcf and coercing them into a data frame (which is really no fun with DNAStringSetList objects), and attempting to manually match up the fields. However, the ALT field can contain a comma-separated list in both fields, which makes things trickier, and I just feel that there has to be a better way. I'd really appreciate any help or feedback; would findoverlap and Vranges work better?
Here's what I've tried so far:
Data frame matching:
##############
## customvcf ##
##############
humie_ref = "hg19"
##read in vcf with custom annotations
custom <- readVcf(custom_db, humie_ref)
##rename chromosomes to match with vcf files
custom <- renameSeqlevels(custom, c("1"="chr1"))
##make data frame of needed fields
cust.df = cbind(chr= seqlevels(custom), as.data.frame(ranges(custom), stringsToFactors=FALSE),
ref=as.character(values(rowData(custom))[["REF"]]), alt=values(rowData(custom))[["ALT"]], info(custom))
##convert alts col to character list for matching
cust.df$alt.value = as.character(cust.df$alt.value)
#############
## multi-vcf ##
#############
##read in everything but sample data for speediness
vcf_param = ScanVcfParam(samples=NA)
##read in multi-sample vcf
vcf <- readVcf(multi_vcf, humie_ref, param=vcf_param)
##data frame vcf info
vcf.df = cbind(chr= seqlevels(vcf), as.data.frame(ranges(vcf), stringsToFactors=FALSE), fixed(vcf), info(vcf))
vcf.df = cbind(chr= seqlevels(vcf), as.data.frame(ranges(vcf), stringsToFactors=FALSE),
ref=as.character(values(rowData(vcf))[["REF"]]), alt=values(rowData(vcf))[["ALT"]], info(vcf))
##convert alt alleles to character for matching
vcf.df$alt.value = as.character(vcf.df$alt.value)
###############
## match-vcf ##
###############
## Here's where I get a little stuck because no matter what I try, I end up with things like "AA|T" matching up to "G", etc.
##subset for all fields except alt
match1 = data.frame(merge(vcf.df, kav.df, by=c("chr", "start", "end", "ref")))
##keep matches with same alt allele
matches = unique(match1[unlist(sapply(match1$ALT.x, grep, match1$ALT.y, fixed=TRUE)),])
Please let me know if you'd also like me to include example output.
Thanks,
Summer
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.16.1 data.table_1.9.4 VariantAnnotation_1.12.9 Rsamtools_1.18.2
[5] Biostrings_2.34.1 XVector_0.6.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4
[9] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[5] Biobase_2.26.0 BiocParallel_1.0.3 biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.1 checkmate_1.5.1 chron_2.3-45
[13] codetools_0.2-10 DBI_0.3.1 digest_0.6.8 fail_1.2
[17] foreach_1.4.2 GenomicAlignments_1.2.1 GenomicFeatures_1.18.3 iterators_1.0.7
[21] packrat_0.4.3 plyr_1.8.1 Rcpp_0.11.3 RCurl_1.95-4.5
[25] reshape2_1.4 RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1
[29] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1 zlibbioc_1.12.0
Valerie-
I can't thank you enough! I ended up using "any" instead of "all" to subset the vcf files so that I could keep rows where any of the alleles matched.
I spoke too soon!
Using the method of comparing alt(vcf) and alt(ref) directly on compressed vcf files returns an error when trying to subset the vcf file, I'm assuming because the vectors are of differing lengths.
I tried the as below, and also tried first creating a subset of the vcf's by merging on chr,start,end,ref. Both methods generated the same error message:
I also tried reading in the vcf files as Vranges, and came up with the strange matches highlighted below:
> ##read in edited ref vcf > kaviar <- readVcfAsVRanges("ref.vcf.gz", humie_ref) Warning message: In .vcf_usertag(map, tag, ...) : ScanVcfParam ‘geno’ fields not present: ‘AD’ > > ##rename chromosomes to match with vcf files > ref <- renameSeqlevels(ref, c("1"="chr1")) > > ################################## > ## Gather VCF files to process ## > ################################## > ##data frame *.vcf.gz files in directory path > vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE)) > > ##read in everything but sample data for speediness > vcf_param = ScanVcfParam(samples=NA) > vcf <- readVcfAsVRanges(vcf_path[1], humie_ref, param=vcf_param) > > ################# > ## Match SNP's ## > ################# > ##create data frames of info to match on > vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)), + alt=alt(vcf), stringsAsFactors=FALSE) > ref.df = data.frame(chr =as.character(seqnames(ref)), start = start(ref), end = end(ref), + ref = as.character(ref(ref)), alt=alt(ref), stringsAsFactors=FALSE) > > ##merge based on all position fields except vcf > col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref"))) > > ##subset col_match by matching alt alleles > col_match[any(col_match$alt.x == col_match$alt.y),] chr start end ref alt.x alt.y 1 chr1 39998059 39998059 A G G 2 chr1 39998059 39998059 A G G 3 chr1 39998084 39998084 C A A 4 chr1 39998084 39998084 C A A 5 chr1 39998085 39998085 G A A 6 chr1 39998085 39998085 G A A 7 chr1 39998223 39998223 T G G 8 chr1 39998223 39998223 T G G 9 chr1 39998238 39998243 TGCGCG CGGGAA CGGGAA 10 chr1 39998238 39998243 TGCGCG CGGGAA CGGGAA 11 chr1 39998238 39998243 TGCGCG TGCACG CGGGAA 12 chr1 39998238 39998243 TGCGCG TGCACG CGGGAA 13 chr1 39998250 39998250 A G G 14 chr1 39998250 39998250 A G G 15 chr1 39998335 39998336 CG TA TA 16 chr1 39998335 39998336 CG TA TA 33 chr1 39998416 39998416 G A A 34 chr1 39998416 39998416 G A A 35 chr1 39998421 39998423 CGT CGG TGC 36 chr1 39998421 39998423 CGT CGG TGC 37 chr1 39998421 39998423 CGT TGC TGC 38 chr1 39998421 39998423 CGT TGC TGC
The direct comparison of alt values between 2 vcf files only makes sense if the vcf files are the same length and have the same positions (so we are comparing apples with apples). Creating the merged data frame by position handles the case where the vcf files have different positions.
You said you got the same error (Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : ....) with the merged data frame. What do the alt.x and alt.y values in your data frame look like? When you do the '==' comparison between the alts, what is the length of the resulting logical vector? Is it a vector or List object? Looking at these answers should help you understand where things went wrong.
In the VRanges code it looks like you've read in 'vcf' and 'kaviar' with readVcfAsVRanges(). You then create a data frame from 'vcf' and 'ref' but 'ref' was read in with readVcf() and not readVcfAsVRanges(). This means the data in 'ref' is not expanded to one alt per row.
Valerie
Hi Valerie-
I was trying to show two methods above of reading the same file in with readVcf and readVcfAsVRanges, sorry if I did make that very obvious!
You are absolutely right about the direct comparison of 2 VCF files. Here is the final code I came up with. Now if I can just get the info field imported as mentioned in my other post, which I'll link here in case it's helpful to anyone, (https://support.bioconductor.org/p/65074/#65141), this will be good to go! Exciting! Thanks so much for your help.