Dear Bioconductor Package Maintainers and Developers,
This started out as question, but I feel like I ended up answering myself, hence posting as a tutorial. Feedback welcome!
Considering a VCF file with many ensemblVEP predictions, including a mixture of factors (e.g. "IMPACT"), and numeric fields (e.g. allele frequencies from ExAC project populations). The parseCSQToGRanges() function seems to turn every field into a factor in the resulting DataFrame object.
Obviously, this is trivial to fix by hand on a small scale (e.g. when plotting a couple of predictions in ggplot2). However, I am writing some generic code that would greatly benefit of knowing what is truly a factor, from what could be plotted on a continuous scale.
While I appreciate the difficulty of the task, considering the infinite number of fields and plugins that could appear in the CSQ/ANN field, I merely wonder:
- Whether there are any foreseeable plan to address this point (i.e. automatically detect factor/numeric fields)
- Independently of 1., which would be the recommended way to deal with the problem. I present my current idea (I didn't want to post without contributing anything!)
- define as numeric the predictions which can be converted to numeric without warning (e.g. as.numeric(as.character(csq$DISTANCE)) ), the remaining being considered factors (see below)
This solution (applicable to the example data set at the end) ended up looking like:
mcols_names <- colnames(mcols(csq)) convertibles <- sapply(mcols_names, function(x_name){ sum( suppressWarnings(!is.na(as.numeric(as.character(mcols(csq)[,x_name] )))), na.rm = TRUE ) > 0 }) which(convertibles)
In words: "which of the predictions are left with at least one value not NA after converting the factor to character and then to numeric".
The only tricky one here is the STRAND, encoded 0/1 although it is a factor.
In any case, I find the code above a little clumsy, and I genuinely wonder whether anyone has something cleaner to suggest :)
For the sake of reporting a replicable situation, I copy-paste the sample code from the parseCSQToGRanges() help page, where the resulting csq object contains field DISTANCE, clearly numeric, although displayed as a factor.
library(ensemblVEP) file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE))) ## The returned 'CSQ' data are unparsed. info(vep)$CSQ ## Parse into a GRanges and include the 'VCFRowID' column. vcf <- readVcf(file, "hg19") csq <- parseCSQToGRanges(vep, VCFRowID=rownames(vcf)) csq[1:4]
Best wishes,
Kevin
Hi Kevin,
There are no plans to enforce data type on the CSQ fields. AFAIK data type is not is not provided from ensembl and guessing at the type can problematic. For example, a field might contain identifiers that mix numbers and letters. These should be considered characters but your code would mark it as 'convertible' to numeric:
> sum(suppressWarnings(!is.na(as.numeric(c("1", "1B")))), na.rm = TRUE) > 0
[1] TRUE
CSQ data are received as a single string, field separated by '|' and all are parsed as characters. See the description in the 'Data format' section, option --vcf:
http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html
You can see the unparsed CSQ (and probably have) with
file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE)))
info(vep)
Previously all character fields were coerced to factors by default when creating the DataFrame. I've checked in a fix to version 1.13.1 (devel) that prevents this so the fields are now plain characters. This doesn't address the issue of the non-characters but I don't think we should enforce data type unless it's provided.
Valerie
Thank you Valerie.
I appreciate the little fix in the devel branch.
I have to admit that this is not a critical issue, so for the time being I'll just leave it to the user of my code to select the appropriate fields when requesting a factor, a character or a numeric type. I was hoping to provide subsetted selections of fields using "is.<class>" methods, but common sense should be enough: a line must sometimes be drawn between "blame-the-programer" and "blame-the-user" scenarios ;)
Thanks for the answer!
Kevin