Question

Best way to create a VRanges object from a large VCF file?

0

Entering edit mode

ruben.drews • 0

@rubendrews-12149

Last seen 2.2 years ago

United Kingdom

Hi all,

I got from my collaborators a VCF file which contains all SNVs from all samples. The rows depict the SNVs and the columns depict the mutational status in all samples. Naturally, it is a sparse matrix. Here is an example line:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  do9157 do9158 do9159 do9160 do9161 do9162 do9163
chr1    3016583 chr1:3016583_A/G        A       G       .       PASS    QSS=68;TQSS=1;NT=ref;QSS_NT=68;TQSS_NT=1;SGT=AA->AG;SOMATIC;VEP=intergenic_variant;CONTEXT=AGTxCCA;TXNSTR=N     GT:DP:FDP:SDP:SUBDP:AU:CU:GU:TU .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       0/1:65:1:0:0:41,42:0,0:23,24:0,0        .:.:.:.:.:.,0:.,0:.,0:.,0

My goal is to have a VRanges object for mutational signature analysis.

The whole file is about 2GB and contains 5,000,000 SNVs in 80 samples. Standard calls of readVcf or readVcfAsVRanges die as the amount of needed RAM exceeds my 32GB.

Method #1:

One problem is, that readVcfAsVRanges multiplies each line for each of the 80 samples. This is not always necessary as only every other line contains more than one entries. But each line contains at least one entry. Is there a way to evaluate the GT field for existence of "." or "0/1" while reading the file?

Method #2:

I tried one method, where I split the file parsing steps into sample-specific processes. This works well and I produce a list containing 80 VRanges objects. Here, I fail to concatenate the list to one VRanges object in the end.

Method #3:

File reading via TabixFile but the manual is quite unclear of how to proceed here.

tl;dr:

How can I create a VRanges object from a large file which contains all mutational statuses of all samples, so mostly dots and zeros I don't need?

variantannotation readvcf readvcfasvranges vcf • 2.4k views

ADD COMMENT • link updated 8.2 years ago by Michael Lawrence ★ 11k • written 8.2 years ago by ruben.drews • 0

0

Entering edit mode

Did you use ScanVcfParam() to selectively import only the information that you need? Also, for smoe use cases there are readGeno() and similar.

ADD REPLY • link 8.2 years ago Martin Morgan 25k

0

Entering edit mode

Thanks for the quick answer!

Method#2: Yes, I basically filter for "." in the GT field:

vcfTemp=readVcfAsVRanges(vcfFile, genome="mm10", param)

vcfTempFiltered=vcfTemp[which(vcfTemp$GT!=".")]

But then I cannot append or merge the files from the list of VRanges objects together.

Currently, I'm having a look at the GenomicFiles package. Thank your for pointing this out to me. Can I ask you for a minimal example of how to loop over a file by yieldSize?

ADD REPLY • link 8.2 years ago ruben.drews • 0

score 1 · Answer 1 · 2017-01-13

You really don't want to concatenate iteratively in this case, as there is a lot of overhead to each contatenation (and if you really want to iterate with side effects, use a for() loop for clarity). You should use lapply() when reading, not sapply(), since you really just want a list. And you can cast the list to a VRangesList, which you can then stack to a VRanges with properly populated sample names:

vcfBig <- stackSamples(List(vcf))

But why not use GenomicFiles? I've never used it before, but it's probably something like this:

reader <- function(vcf) readVcfAsVRanges(vcf, genome="mm10", param)
filter <- function(vr) vr[vr$GT!="."]
reducer <- function(vrl) stackSamples(List(vrl))
vcfBig <- reduceByYield(VcfFile(vcfFile, yieldSize=1e5), reader, filter, reducer, iterate=FALSE)

Btw, since I doubt you're dealing with haploid mice, the missing value for the genotypes should be "./.", not ".". So your VCF is incorrect.

score 0 · Answer 2 · 2017-01-13

Decided to do it the quick-and-dirty way. For the sake of this community, I'll post my solution but it's not pretty. For my 5,000,000x80 it takes around 20-25 minutes.

inputLargeVcfFile=function(vcfFile){

    # Get samples from VCF file
    hvcf=scanVcfHeader(vcfFile)
    sampleNames=samples(hvcf)

    vcf=VRanges()
    vcf=sapply(sampleNames, function(sampleName) { 
        print(sampleName)
        param=ScanVcfParam(samples=sampleName, trimEmpty=TRUE, info=c("CONTEXT", "SGT"), geno=c("GT", "AU", "CU", "GU", "TU"))
        vcfTemp=readVcfAsVRanges(vcfFile, genome="mm10", param)
        vcfTempFiltered=vcfTemp[which(vcfTemp$GT!=".")]
        }
    )

    vcfBig=VRanges()
    vcfTemp=sapply(vcf, function(vcfSmall) vcfBig <<- append(vcfBig, vcfSmall))
    rm(vcfTemp)
    
    return(vcfBig)
}

Thanks Michael for your help!