readVcfAsVRanges + FilterRules
3
1
Entering edit mode
Sigve Nakken ▴ 50
@sigve-nakken-6575
Last seen 2.1 years ago
Norway

Hi,

I have successfully used the VariantAnnotation (1.12.9 now) for processing my multi-sample VCF files for some time now. Especially, I have utilized the combination of filterVcf (with prefilters) and readVcfAsVRanges to retrieve sample variant calls from my sequencing projects. I have not been able to filter on sample genotype data (e.g. read depth etc.) with these routines, so I implemented this routine as a third step.

Lately, as the number of calls and number of samples have increased, one step is getting substantially slow and eating up memory; readVcfAsVRanges. I guess the reason for this is the large set of variants that pass the preFilter (> 100,000 PASSed variants), and that the filters in filterVcf (which according to the tutorial could be applied on the genotype data) would not make any sense for multi-sample VCF files (correct me if am wrong).

Is there any way I can utilize readVcfAsVRanges in a more efficient manner, enabling me to skip (not load into memory) samples with a 'NULL' genotype?

best,

Sigve

 

variantannotation filterVcf readvcfasvranges filterRules • 2.3k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

I'm not sure if I fully understand the problem, but perhaps GenomicFiles::reduceByYield is helpful? In psuedo-code:

X = TabixFile(..., yieldSize=...)
YIELD = function(x)
    ## 'x' is the TabixFile
    readVcfAsVRanges(x, "hg19", param=...)
MAP = function(x) {
    ## 'x' is the (moderately sized) VRanges object from YIELD
    keep = ... # keep interesting genotypes
    v[keep]
}
REDUCE = c           ## concatenate the output of successive calls to MAP
DONE = function(x)
    ## 'x' is the output of YIELD; DONE signals when the entire file is processed
    length(x) == 0

and then put it together as

vr = reduceByYield(X, YIELD, MAP, REDUCE, DONE)

(REDUCE and DONE are actually default behaviors, so can be omitted). The end result is a modest-sized VRanges object.

ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi Sigve,

One approach for reducing memory use is to set 'yieldSize' in a TabixFile object (VcfFile object in VariantAnnotation devel). Using a TabixFile as the 'x' arg in readVcfAsVRanges() will iterate through the file in chunks specified by 'yieldSize'.

> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> TabixFile(fl, index=character(), yieldSize=25000)
class: TabixFile 
path: /home/vobencha/R/R-rel/R-3-1-branch/library/VariantAnnotation/e.../ex2.vcf
isOpen: FALSE 
yieldSize: 25000 

 

As for the filtering, filterVcf() operates by removing rows of the VCF that don't meet the criteria. The filters produce a logical vector, the same length  as nrow(VCF), which is used as a pass/fail indicator. This row-based filtering is well suited for FIXED or INFO fields or single-sample files. You could create a filter that tested for a NULL genotype (or read depth cutoff) but ultimately need TRUE/FALSE for each row; if any sample fails the whole row is removed. If I understand correctly you want to remove samples with a NULL genotype for any of the variants. Unfortunately sample-specific filtering can't be done with the current filterVcf(). To filter by sample use ScanVcfParam(sample = c("samp1", "samp25", "samp50")) as the 'param' argument to readVcf() or readVcfAsVRanges().

As an fyi, in VariantAnnotation in devel the default 'param' has changed from VRangesScanVcfParam() which imports in minimal fields to ScanVcfParam() which imports in all fields. Since you're using 1.12.9 you should be getting minimal fields, just a heads up for future use.

If the 'yieldSize' approach doesn't work for you please show how you're calling the function and we'll go from there.

Valerie

ADD COMMENT
0
Entering edit mode
Sigve Nakken ▴ 50
@sigve-nakken-6575
Last seen 2.1 years ago
Norway

Hi Valerie,

Thanks for the rapid response! Processing one sample at a time, using samples in readVcfAsVRanges, appears to be the approach that works for me. I tried the tabix-approach too, but that one gave strange results (not sure why).

I understand the row-based filtering logic used in filterVcf(), my point was also to note that while the second pass of filtering (VCF instance-based) in filterVcf() appears very useful (e.g. isSnp(), lowCoverage() etc.), it cannot really be utilized if the VCF file contains more than one sample (which is often the case for many users, I would guess). If you have a VCF with 10,000 variants and 100 sample columns, and where (on average) each variant has proper genotype data for two-three samples, readVcfAsVRanges loads a huge number of sample calls that ideally should have been skipped. From your response, I think you understand this case.

Sigve 

ADD COMMENT
0
Entering edit mode

I'm glad the sample approach worked. If you want to send the results of the tabix run (off line or here) I can take a look; that approach should work too.

I agree, most genotype filters are only meaningful for single-sample files (I'll add this to the docs). The case where genotype filtering does work for multiple samples is when the criteria is applied across samples and not to the individual, e.g.,  'keep rows where all samples have DP > 10' .

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6