I'm using read.vcf() in the VariantAnnotation package to get SNPs from a VCF file. It returns a CollapsedVCF object. I then extract the alternate allele calls using fixed(vcf)$ALT, which returns a DNAStringSetList. Each element contains a DNAStringSet with one or more characters (there are tri-morphic SNPs in this data set). I would then like to write this to a new file, concatenating the allele for each SNP with a comma. To be clear, I need to convert from this:
DNAStringSetList of length 6
[[1]] A
[[2]] C
[[3]] A G
[[4]] T
[[5]] A C T
[[6]] A
to this character vector:
[1] "A" "C" "A,G" "T" "A,C,T" "A"
and do it quickly for over a million SNPs. I have a slow method below, but I'm wondering if there is some slick trick that will do it more quickly. I checked the DNAStringSetList and DNAStringSet documentation and don't see a quicker way to make this conversion.
Here is sample code for 6 SNPs.
library(VariantAnnotation)
alt = DNAStringSetList("A", "C", c("A", "G"), "T", c("A", "C", "T"), "A")
x = lapply(alt, as.character)
x = sapply(x, paste, collapse = ",")
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] VariantAnnotation_1.12.9 Rsamtools_1.18.3 Biostrings_2.34.1
[4] XVector_0.6.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.5
[7] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.2 base64enc_0.1-2 BatchJobs_1.6
[4] BBmisc_1.9 Biobase_2.26.0 BiocParallel_1.0.3
[7] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6
[10] BSgenome_1.34.1 checkmate_1.5.2 codetools_0.2-11
[13] DBI_0.3.1 digest_0.6.8 fail_1.2
[16] foreach_1.4.2 GenomicAlignments_1.2.2 GenomicFeatures_1.18.7
[19] iterators_1.0.7 RCurl_1.95-4.5 RSQLite_1.0.0
[22] rtracklayer_1.26.3 sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1 zlibbioc_1.12.0
Thanks in advance,
Daniel Gatti
The Jackson Laboratory
Hi Dan, Val,
You can use
unstrsplit()
for concatenating the allele for each SNP with a comma. Should be much faster thanlapply( , paste0)
.Cheers,
H.