Select specific variants from vcf file
2
0
Entering edit mode
Adam • 0
@adam-10025
Last seen 12 months ago
Poland

Hello,

 

Does anyone know how to extract specific variants from vcf files?

I have several vcf files with variants from NGS experiment, I'd like to subset only variants such as missense(stop gain stop loss, start gain, start loss)/splice site(in intron and exon) and all frameshift mutations.

What is more, I'm looking for changes with small MAF - I know there is 'COMMON=0' parameter.

So how can I do this filtering but on WINDOWS, or with some paclage in R?

All the best,

Adam.

vcf • 4.6k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States

Use ScanVcfParam() with readVcf() to selectively import your data into R, or filterVcf() to create a new VCF file with an appropriate subset. The primary source of documentation are the vignettes and man pages of relevant functions, available from within R in the usual way for from the package landing page.

VCF files are of course just text files, but they are highly structured; grep is ok for some basic manipulations (filterVcf does this for the 'prefilters') but other computations involve unpacking the data more completely. 

Maybe a little philosophical but there is tremendous value to semantically 'rich' data that one loses with dplyr; a short compare and contrast is for instance at slides 14 - 16 of these slides. This value is compounded the more you use Bioconductor -- for a one-off it seems like overkill, but for daily use you find yourself spending less time worrying about data representation and more time addressing the informatic, statistical, and biological questions that motivate your research.

ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 11 minutes ago
United States

In basic terms you want to read the VCF file(s) into R using the VariantAnnotation package. You can then use a TxDb package to get a transcripts GRanges object and then use subsetByOverlaps to subset your VCF to those that overlap a known transcript. You can then use predictCoding and a BSgenome package to predict the coding consequences. This is all covered in the VariantAnnotation vignette, so I would direct you there for more details.

ADD COMMENT
0
Entering edit mode

Yes, actually I read about this package but don't you think it's a bit complicated? I'm asking becasue vcf file already has variation type, missense, splice region, frameshift etc. So maybe typical filter and grep from dplyr in R would be enough?

ADD REPLY

Login before adding your answer.

Traffic: 845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6