cannot allocate memory?
1
1
Entering edit mode
Haiying.Kong ▴ 110
@haiyingkong-9254
Last seen 5.8 years ago
Germany

I am working on a linux system with mem 250G. Currently another program is running which takes mem less than 20GB.

If I try to load a vcf file (<16GB) with readVcf function in VariantAnnotation package, I get error message:

> germ.mut = readVcf("/home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf", "hg19")
Error: scanVcf: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
  path: /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf

I tried gc() before running the line, and got the same error message.

I could follow the solution on: Error in reading 1000 genomes data

But since I have so much memory, is there any way to just load whole vcf at once?

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: openSUSE 13.1 (Bottle) (x86_64)

locale:
 [1] LC_CTYPE=en_GB.UTF-8          LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8           LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8       LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8          LC_NAME=en_GB.UTF-8
 [9] LC_ADDRESS=en_GB.UTF-8        LC_TELEPHONE=en_GB.UTF-8
[11] LC_MEASUREMENT=en_GB.UTF-8    LC_IDENTIFICATION=en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] cgdv17_0.12.0              VariantAnnotation_1.20.3
 [3] Rsamtools_1.26.2           Biostrings_2.42.1
 [5] XVector_0.14.1             SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.4
 [9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0
[13] BiocInstaller_1.24.0       xlsx_0.5.7
[15] xlsxjars_0.6.1             rJava_0.9-8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10             AnnotationDbi_1.36.2     GenomicAlignments_1.10.1
 [4] zlibbioc_1.20.0          BiocParallel_1.8.2       BSgenome_1.42.0
 [7] lattice_0.20-35          tools_3.3.3              grid_3.3.3
[10] DBI_0.6-1                digest_0.6.12            Matrix_1.2-8
[13] rtracklayer_1.34.2       bitops_1.0-6             biomaRt_2.30.0
[16] RCurl_1.95-4.8           memoise_1.1.0            RSQLite_1.1-2
[19] GenomicFeatures_1.26.4   XML_3.98-1.6

 

readVcf • 2.5k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Input only the information you're interested in, using specialized functions readInfo(), readGeno() or more generally readVcf with the ScanVcfParam() function. If the data are still too large, iterate through using VcfFile() with a yieldSize() argument, and GenomicFiles::reduceByYield(). The relevant help pages and package vignettes (e.g., on the landing pages https://bioconductor.org/pacages/VariantAnnotation) have examples that might help you to pose additional more specific questions if you run into problems.

ADD COMMENT
0
Entering edit mode

Thank you very much for your reply.

How is it decided "the data are still too large"? I should have more than 200GB mem, and the vcf file I am trying to load is less than 16GB.

ADD REPLY
0
Entering edit mode

By 'still to large' I meant that you receive the same error about memory allocation.

When I look at the code, it seems like the error message could be moderately misleading. The error occurs if one of the components of the VCF file (e.g., the GT field) the product of the dimensions of the resulting matrix were larger than the maximum integer size (about 2.14 billion). With 1000 samples and a field taking on three values per sample, the maximum number of variants would be about 715,000 (2.14 billion / 1000 / 3). Making  the code work with much larger data sets isn't a priority for me -- process it in chunks to manage memory, allow other processes and users to access the computer's resources, and facilitate parallel evaluation.

There are a number of additional possibilities. Remember that a VCF file is  text file, but one operates in R on different data types, so the character value '1' in a VCF file is a single byte, but represented by a double in R so take up 8 bytes. Depending on memory allocation patterns, memory available to the operating system may become fragmented, so that while there are many more than x bytes available, contiguous blocks are all less than the amount required. Etc.

ADD REPLY
0
Entering edit mode

Thank you very much for the explanation.

ADD REPLY

Login before adding your answer.

Traffic: 555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6