Hi all,
I have a feeling this is an easy one, but browsing the forum and the BiocParallel vignette, I can't find it.
Is there a BiocParallel equivalent of parRapply?
I have a nice MulticoreParam object to handle parallel computing, and if I could I would like to avoid making a separate cluster using the parallel package.
For context, I would like to apply a function to each row of the geno(vcf)[["GT"]] matrix to calculate the allele frequency of each variant (AF = 1*[0/1] + 2*[1/1]). I have ~80k thousand variants that I would like to reduce by filtering those > 1% MAF.
...except if I am missing an existing function that does part or all that I want? The reason why I am calculating it myself is that there are some funky genotypes in there (1/0, ./1, 0/., and more) which makes me wonder how the AF field of the VCF is calculated. At this stage, I simply prefer calculating the AF checking genotypes myself.
Cheers!
==== EDIT to add example ===
Thanks for both your comments.
Here is a made-up example:
# Define accepted genotypes refG = c("0/0") altHet = c("0/1") altHom = c("1/1") # A sample of genotypes observed in the VCF file allG = c( "0/0", "0", "1/1", "1/0", "1", "2", ".", "./.", "./0", "./1", "0/.", "1/.") # Make up a matrix like the one obtained using VariantAnnotation genos <- matrix( data = sample( x = allG, size = 1E6, replace = TRUE, prob = c(rep(0.25, 3), rep(0.25/9, 9))), nrow = 1E3, ncol = 1E3) dim(genos) genos[1:4,1:4] # My current solution to calculating MAF considering "accepted" genotypes minorAlleleFreq <- function(genos, refGenos, altHetGenos, altHomGenos){ tableCounts <- table(factor( x = genos, levels = c(refGenos, altHetGenos, altHomGenos) )) alleleCount <- tableCounts[altHetGenos] + 2 * tableCounts[altHomGenos] alleleNumber <- 2 * sum(tableCounts) return(alleleCount / alleleNumber) } mafs <- apply( X = genos, MARGIN = 1, FUN = minorAlleleFreq, refGenos = refG, altHetGenos = altHet, altHomGenos = altHom) # For your eyes only :) hist(mafs)
I am just wondering if there was a more efficient way to calculate mafs above.
Cheers!
There isn't a parRapply equivalent. But can you provide more detail, maybe with an example using one of the vcf files that comes with VariantAnnotation or a simple matrix constructed 'by hand' of what you want to calculate? I'd guess that there is a vectorized version of the calculation that will be very fast, so no need for parallel evaluation.
Martin is right, there has to be a faster way than looping, and it doesn't make sense to use
rapply
over a genotype matrix anyway, since there is only a single level of nesting. To avoid the nesting entirely, callexpand(vcf)
and then use conventional matrix operations.