I tried to read in a large compressed multi-vcf file (20GB) into r and then apply somaticsignatures on the dataset in r. I have already tried to read in chunk and selected the columns that I need to use.I requested 80G on campus cluster but still failed to read in the vcf file at the repeat command stage.
Can anyone look into the code and any suggestion is appreciated!
library(VariantAnnotation)
library(BSgenome.Hsapiens.UCSC.hg19)
library(SomaticSignatures)
library(ggplot2)
setwd('~/VCF')
fl="all.vcf.gz"
vcffile = open(VcfFile(fl, yieldSize=5000))
repeat {
vcf = readVcf(vcffile, "BSgenome.Hsapiens.UCSC.hg19",param=ScanVcfParam(info=c("RAF","AF","INFO"), geno=c("GT","DS")))
if (nrow(vcf) == 0)
break
}
seqlevels(vcf)<-c('chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr20','chr21','chr22','chrX')
vr<- as(vcf, "VRanges")
mc<-mutationContext(vr,BSgenome.Hsapiens.UCSC.hg19)
mm<-motifMatrix(mc, normalize=TRUE,group = "KRAS")
mm[mm == 0]<-0.000001
sigs_nmf = identifySignatures(mm, 6, nmfDecomposition)
pdf(file="plotSignatures.pdf", width=10, height=8)
plotObservedSpectrum(sigs_nmf)
dev.off()
sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] SomaticSignatures_2.26.0 BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.58.0
[4] rtracklayer_1.50.0 survminer_0.4.9 ggpubr_0.4.0
[7] ggplot2_3.3.5 survivalROC_1.0.3 survival_3.2-11
[10] readxl_1.3.1 MutationalPatterns_3.0.1 NMF_0.23.0
[13] cluster_2.1.2 rngtools_1.5 pkgmaker_0.32.2
[16] registry_0.5-1 VariantAnnotation_1.36.0 Rsamtools_2.6.0
[19] Biostrings_2.58.0 XVector_0.30.0 SummarizedExperiment_1.20.0
[22] Biobase_2.50.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
[25] IRanges_2.24.1 S4Vectors_0.28.1 MatrixGenerics_1.2.1
[28] matrixStats_0.59.0 BiocGenerics_0.36.1 maftools_2.6.05
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 GGally_2.1.2 tidyr_1.1.3 bit64_4.0.5
[5] knitr_1.33 DelayedArray_0.16.3 data.table_1.14.0 rpart_4.1-15
[9] RCurl_1.98-1.3 AnnotationFilter_1.14.0 doParallel_1.0.16 generics_0.1.0
[13] GenomicFeatures_1.42.3 cowplot_1.1.1 RSQLite_2.2.7 shadowtext_0.0.8
[17] proxy_0.4-26 bit_4.0.4 enrichplot_1.10.2 xml2_1.3.2
[21] assertthat_0.2.1 viridis_0.6.1 xfun_0.24 hms_1.1.0
[25] fansi_0.5.0 progress_1.2.2 dbplyr_2.1.1 km.ci_0.5-2
[29] igraph_1.2.6 DBI_1.1.1 geneplotter_1.68.0 htmlwidgets_1.5.3
[33] reshape_0.8.8 purrr_0.3.4 ellipsis_0.3.2 dplyr_1.0.7
[37] backports_1.2.1 markdown_1.1 annotate_1.68.0 gridBase_0.4-7
[41] biomaRt_2.46.3 vctrs_0.3.8 ggalluvial_0.12.3 ensembldb_2.14.1
[45] abind_1.4-5 cachem_1.0.5 withr_2.4.2 ggforce_0.3.3
[49] checkmate_2.0.0 GenomicAlignments_1.26.0 prettyunits_1.1.1 DOSE_3.16.0
[53] lazyeval_0.2.2 crayon_1.4.1 genefilter_1.72.1 pkgconfig_2.0.3
[57] labeling_0.4.2 tweenr_1.0.2 ProtGenerics_1.22.0 nnet_7.3-16
[61] rlang_0.4.11 lifecycle_1.0.0 table1_1.4.2 downloader_0.4
[65] BiocFileCache_1.14.0 dichromat_2.0-0 cellranger_1.1.0 polyclip_1.10-0
[69] graph_1.68.0 Matrix_1.3-4 KMsurv_0.1-5 carData_3.0-4
[73] zoo_1.8-9 base64enc_0.1-3 pheatmap_1.0.12 png_0.1-7
[77] viridisLite_0.4.0 bitops_1.0-7 pROC_1.17.0.1 blob_1.2.1
[81] stringr_1.4.0 qvalue_2.22.0 jpeg_0.1-8.1 rstatix_0.7.0
[85] ggsignif_0.6.2 scales_1.1.1 memoise_2.0.0 magrittr_2.0.1
[89] plyr_1.8.6 zlibbioc_1.36.0 compiler_4.0.3 scatterpie_0.1.6
[93] tinytex_0.32 RColorBrewer_1.1-2 pcaMethods_1.82.0 DESeq2_1.30.1
[97] cli_2.5.0 htmlTable_2.2.1 Formula_1.2-4 MASS_7.3-54
[101] tidyselect_1.1.1 stringi_1.6.2 forcats_0.5.1 GOSemSim_2.16.1
[105] askpass_1.1 locfit_1.5-9.4 latticeExtra_0.6-29 ggrepel_0.9.1
[109] survMisc_0.5.5 grid_4.0.3 fastmatch_1.1-0 tools_4.0.3
[113] rio_0.5.27 rstudioapi_0.13 foreach_1.5.1 foreign_0.8-81
[117] gridExtra_2.3 farver_2.1.0 ggraph_2.0.5 digest_0.6.27
[121] rvcheck_0.1.8 BiocManager_1.30.16 pracma_2.3.3 ggtext_0.1.1
[125] Rcpp_1.0.6 gridtext_0.1.4 car_3.0-10 broom_0.7.8
[129] OrganismDbi_1.32.0 httr_1.4.2 AnnotationDbi_1.52.0 ggbio_1.38.0
[133] biovizBase_1.38.0 colorspace_2.0-2 XML_3.99-0.6 splines_4.0.3
[137] RBGL_1.66.0 graphlayouts_0.7.1 xtable_1.8-4 tidygraph_1.2.0
[141] R6_2.5.0 Hmisc_4.5-0 pillar_1.6.1 htmltools_0.5.1.1
[145] glue_1.4.2 fastmap_1.1.0 clusterProfiler_3.18.1 BiocParallel_1.24.1
[149] codetools_0.2-18 fgsea_1.16.0 utf8_1.2.1 lattice_0.20-44
[153] tibble_3.1.2 curl_4.3.2 zip_2.2.0 GO.db_3.12.1
[157] openxlsx_4.2.4 openssl_1.4.4 munsell_0.5.0 DO.db_2.9
[161] GenomeInfoDbData_1.2.4 iterators_1.0.13 haven_2.4.1 reshape2_1.4.4
[165] gtable_0.3.0
Hi James, Thank you very much for your prompt response and correcting my mistake. I tried the way you mentioned here but got job killed due to the lack of memory. So I am wondering if there is any other way to overcome this... Thank you!
Also, Can you explain a bit in detail of pass in a GRanges object and maybe give a toy example of reading in chunk? I didn't quite get this part. Thank you.
Here's some pseudo-code. Like I said, I know nothing about the
SomaticSignatures
package, other than it wants aVRanges
object. Your problem is that you can't read in all the VCF at one shot, so what you want to do is iterate through the VCF and get out what you need, which presumably will be more compact. But how you do that is sort of tricky. As an example, consider this:Iterating through your VCF and then concatenating the new data to the existing is absolutely brutal because you have to create a new object each time, which is why my first example is so slow. But if you pre-allocate the object that you are going to use and then just put the data into it, things go way faster. The trick is that you don't know a priori how big the object needs to be. Well, you could use awk and wc to get the number of rows, which is probably the way to go. So I'd do that, and then you can do something like
In order to make this work you will have to write a function that takes the
CollapsedVCF
object you read in at each iteration and extracts all the information you need into a matrix that you can then dump into your pre-formed matrix. Then when you are done, you can convert the matrix to aVRanges
object.There might be something in
SomaticSignatures
that will get the data you need, so maybe you don't need to write a function. But that's the general idea.