Question

Complementing alleles in a VCF

0

Entering edit mode

Alex Gutteridge ▴ 650

@alex-gutteridge-2935

Last seen 10.2 years ago

United States

Hi All, I'm having some difficulty with VCF handling and feeding 1000 genome VCFs into predictCoding from the VariantAnnotation package. Can anyone help? When I read 1000 genomes VCFs the SNPs come through with no strand specified e.g (where vcf.file and param are a 1000 genomes VCF file and a GRanges respectively): > vcf = readVcf(vcf.file, "hg19", param) > fixed(vcf) GRanges with 2610 ranges and 5 elementMetadata cols: seqnames ranges strand | paramRangeID <rle> <iranges> <rle> | <factor> rs186838828 chr2 [167051756, 167051756] * | SCN9A rs191667986 chr2 [167051865, 167051865] * | SCN9A rs139483482 chr2 [167051900, 167051900] * | SCN9A rs182687583 chr2 [167052080, 167052080] * | SCN9A rs115766730 chr2 [167052144, 167052144] * | SCN9A rs186613025 chr2 [167052168, 167052168] * | SCN9A rs73017538 chr2 [167052328, 167052328] * | SCN9A rs114327563 chr2 [167052375, 167052375] * | SCN9A rs191401619 chr2 [167052418, 167052418] * | SCN9A ... ... ... ... ... ... rs73025590 chr2 [167231750, 167231750] * | SCN9A rs141453198 chr2 [167231812, 167231812] * | SCN9A rs16852069 chr2 [167231890, 167231890] * | SCN9A rs181276399 chr2 [167231932, 167231932] * | SCN9A rs185839773 chr2 [167232251, 167232251] * | SCN9A rs191091185 chr2 [167232439, 167232439] * | SCN9A rs148362057 chr2 [167232446, 167232446] * | SCN9A rs141521157 chr2 [167232450, 167232450] * | SCN9A rs1881440 chr2 [167232463, 167232463] * | SCN9A REF ALT QUAL FILTER <dnastringset> <dnastringsetlist> <numeric> <character> rs186838828 T ######## 100 PASS rs191667986 T ######## 100 PASS rs139483482 T ######## 100 PASS rs182687583 G ######## 100 PASS rs115766730 G ######## 100 PASS rs186613025 A ######## 100 PASS rs73017538 A ######## 100 PASS rs114327563 A ######## 100 PASS rs191401619 C ######## 100 PASS ... ... ... ... ... rs73025590 A ######## 100 PASS rs141453198 C ######## 100 PASS rs16852069 A ######## 100 PASS rs181276399 G ######## 100 PASS rs185839773 A ######## 100 PASS rs191091185 C ######## 100 PASS rs148362057 A ######## 100 PASS rs141521157 A ######## 100 PASS rs1881440 C ######## 100 PASS --- seqlengths: chr2 NA When I feed these to predictCoding(), genes on the complement strand are not dealt with correctly - the ALT alleles given in the VCF are not complemented first, so the variant codons are not translated right. > txdb = TxDb.Hsapiens.UCSC.hg19.knownGene > aa = predictCoding(vcf.filt, txdb, Hsapiens) It seems like predictCoding should be able to deal with genes on the complement strand, but maybe it requires the strand to be set correctly in the VCF? I've tried and failed to find a way of editing the VCF - if nothing else to just complement the ALT alleles before running predictcoding, but it doesn't seem to work. Though the manual seems to suggest it is possible. I didn't find a way to actually set the strand directly either: ?alt(x)?, ?alt(x) <- value?: Returns or sets the alternate allele data from the ALT column of the VCF file. ?value? can be a ?DNAStringSet? or a ?CharacterList? (for a structural VCF file). > tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]])) > tmp.alt A DNAStringSet instance of length 2610 width seq [1] 1 T [2] 1 G [3] 1 G [4] 1 T [5] 1 T [6] 1 C [7] 1 C [8] 1 C [9] 1 C ... ... ... [2602] 1 A [2603] 1 T [2604] 1 C [2605] 1 T [2606] 1 C [2607] 1 C [2608] 1 C [2609] 1 C [2610] 1 T > alt(vcf) = tmp.alt Error in function (classes, fdef, mtable) : unable to find an inherited method for function "alt<-", for signature "VCF", "DNAStringSet" > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RColorBrewer_1.0-5 [2] caTools_1.12 [3] bitops_1.0-4.1 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 [5] GenomicFeatures_1.8.1 [6] org.Hs.eg.db_2.7.1 [7] RSQLite_0.11.1 [8] DBI_0.2-5 [9] AnnotationDbi_1.18.0 [10] Biobase_2.16.0 [11] VariantAnnotation_1.2.5 [12] Rsamtools_1.8.3 [13] BSgenome.Hsapiens.UCSC.hg19_1.3.17 [14] BSgenome_1.24.0 [15] Biostrings_2.24.1 [16] GenomicRanges_1.8.3 [17] IRanges_1.14.2 [18] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 grid_2.15.0 lattice_0.20-6 Matrix_1.0-6 [5] RCurl_1.91-1 rtracklayer_1.16.1 snpStats_1.6.0 splines_2.15.0 [9] stats4_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4 [13] zlibbioc_1.2.0 -- Alex Gutteridge

VariantAnnotation BSgenome BSgenome genomes VariantAnnotation VariantAnnotation BSgenome • 1.4k views

ADD COMMENT • link updated 12.6 years ago by Martin Morgan 25k • written 12.6 years ago by Alex Gutteridge ▴ 650

score 0 · Answer 1 · 2012-04-19

Hi Alex -- On 04/19/2012 02:39 AM, Alex Gutteridge wrote: > Hi All, > > I'm having some difficulty with VCF handling and feeding 1000 genome > VCFs into predictCoding from the VariantAnnotation package. Can anyone > help? > > When I read 1000 genomes VCFs the SNPs come through with no strand > specified e.g (where vcf.file and param are a 1000 genomes VCF file and > a GRanges respectively): > >> vcf = readVcf(vcf.file, "hg19", param) >> fixed(vcf) > GRanges with 2610 ranges and 5 elementMetadata cols: > seqnames ranges strand | paramRangeID > <rle> <iranges> <rle> | <factor> > rs186838828 chr2 [167051756, 167051756] * | SCN9A > rs191667986 chr2 [167051865, 167051865] * | SCN9A > rs139483482 chr2 [167051900, 167051900] * | SCN9A > rs182687583 chr2 [167052080, 167052080] * | SCN9A > rs115766730 chr2 [167052144, 167052144] * | SCN9A > rs186613025 chr2 [167052168, 167052168] * | SCN9A > rs73017538 chr2 [167052328, 167052328] * | SCN9A > rs114327563 chr2 [167052375, 167052375] * | SCN9A > rs191401619 chr2 [167052418, 167052418] * | SCN9A > ... ... ... ... ... ... > rs73025590 chr2 [167231750, 167231750] * | SCN9A > rs141453198 chr2 [167231812, 167231812] * | SCN9A > rs16852069 chr2 [167231890, 167231890] * | SCN9A > rs181276399 chr2 [167231932, 167231932] * | SCN9A > rs185839773 chr2 [167232251, 167232251] * | SCN9A > rs191091185 chr2 [167232439, 167232439] * | SCN9A > rs148362057 chr2 [167232446, 167232446] * | SCN9A > rs141521157 chr2 [167232450, 167232450] * | SCN9A > rs1881440 chr2 [167232463, 167232463] * | SCN9A > REF ALT QUAL FILTER > <dnastringset> <dnastringsetlist> <numeric> <character> > rs186838828 T ######## 100 PASS > rs191667986 T ######## 100 PASS > rs139483482 T ######## 100 PASS > rs182687583 G ######## 100 PASS > rs115766730 G ######## 100 PASS > rs186613025 A ######## 100 PASS > rs73017538 A ######## 100 PASS > rs114327563 A ######## 100 PASS > rs191401619 C ######## 100 PASS > ... ... ... ... ... > rs73025590 A ######## 100 PASS > rs141453198 C ######## 100 PASS > rs16852069 A ######## 100 PASS > rs181276399 G ######## 100 PASS > rs185839773 A ######## 100 PASS > rs191091185 C ######## 100 PASS > rs148362057 A ######## 100 PASS > rs141521157 A ######## 100 PASS > rs1881440 C ######## 100 PASS > --- > seqlengths: > chr2 > NA > > When I feed these to predictCoding(), genes on the complement strand are > not dealt with correctly - the ALT alleles given in the VCF are not > complemented first, so the variant codons are not translated right. A definitive answer won't be available until next week. Probably a more straight-forward way, but... > >> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene >> aa = predictCoding(vcf.filt, txdb, Hsapiens) > > It seems like predictCoding should be able to deal with genes on the > complement strand, but maybe it requires the strand to be set correctly > in the VCF? I've tried and failed to find a way of editing the VCF - if > nothing else to just complement the ALT alleles before running > predictcoding, but it doesn't seem to work. Though the manual seems to > suggest it is possible. I didn't find a way to actually set the strand > directly either: > > ?alt(x)?, ?alt(x) <- value?: Returns or sets the alternate allele data > from the ALT column of the VCF file. ?value? can be a > ?DNAStringSet? or a ?CharacterList? (for a structural VCF > file). > >> tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]])) >> tmp.alt > A DNAStringSet instance of length 2610 > width seq > [1] 1 T > [2] 1 G > [3] 1 G > [4] 1 T > [5] 1 T > [6] 1 C > [7] 1 C > [8] 1 C > [9] 1 C > ... ... ... > [2602] 1 A > [2603] 1 T > [2604] 1 C > [2605] 1 T > [2606] 1 C > [2607] 1 C > [2608] 1 C > [2609] 1 C > [2610] 1 T >> alt(vcf) = tmp.alt > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function "alt<-", for signature > "VCF", "DNAStringSet" as an example fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") you can set the strand on vcf with > strand(rowData(vcf)) 'factor' Rle of length 5 with 1 run Lengths: 5 Values : * Levels(3): + - * > strand(rowData(vcf)) <- "+" > strand(rowData(vcf)) 'factor' Rle of length 5 with 1 run Lengths: 5 Values : + Levels(3): + - * (probably there should be strand,VCF, and strand<-,VCF,ANY methods). For alt, it seems like DNAStringSetList needs a constructor DNAStringSetList <- function(...) { listData <- list(...) if (length(listData) == 1 && is.list(listData[[1L]])) listData <- listData[[1L]] ok <- sapply(listData, is, "DNAStringSet") if (!all(ok)) listData <- lapply(listData, as, "DNAStringSet") IRanges:::newCompressedList("DNAStringSetList", listData) } and then > orig <- values(alt(vcf))[["ALT"]] > dna <- complement(unlist(orig)) > alt(vcf) <- relist(dna, orig) (It's confusing, though convenient, how alt() returns a GRanges, but alt<- expects 'value' to be a DNAStringSetList; also as you note the docs say there should be a VCF,DNAStringSet method for alt<-, but there doesn't seem to be one (it wouldn't be appropriate in your case, because ALT is a DNAStringSetList). Martin > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] RColorBrewer_1.0-5 > [2] caTools_1.12 > [3] bitops_1.0-4.1 > [4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 > [5] GenomicFeatures_1.8.1 > [6] org.Hs.eg.db_2.7.1 > [7] RSQLite_0.11.1 > [8] DBI_0.2-5 > [9] AnnotationDbi_1.18.0 > [10] Biobase_2.16.0 > [11] VariantAnnotation_1.2.5 > [12] Rsamtools_1.8.3 > [13] BSgenome.Hsapiens.UCSC.hg19_1.3.17 > [14] BSgenome_1.24.0 > [15] Biostrings_2.24.1 > [16] GenomicRanges_1.8.3 > [17] IRanges_1.14.2 > [18] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.12.0 grid_2.15.0 lattice_0.20-6 Matrix_1.0-6 > [5] RCurl_1.91-1 rtracklayer_1.16.1 snpStats_1.6.0 splines_2.15.0 > [9] stats4_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4 > [13] zlibbioc_1.2.0 > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793