Dear all,
I ran the following code and hit the error already described in https://support.bioconductor.org/p/104454/, https://support.bioconductor.org/p/104845/, and https://support.bioconductor.org/p/106479/.
Additional note: The main EnsEMBL BioMart seems to be down for maintenance currently, so I had to use one of the mirror sites as advised.
My code below (the relevant portion, if needed, I could try to provide a minimum runnable example). I am kind of stuck currently.
Thank you.
# Other libraries
library("BiocParallel")
library("DESeq2")
library("ggrepel")
library("tidyverse")
library("readr")
library("stringr")
library("AnnotationDbi")
library("EnsDb.Hsapiens.v86")
library("biomaRt")
ENSEMBL_DB_HOST = "useast.ensembl.org" # Set back to default, once they are up and running again
ENSEMBL_VERSION = "Ensembl Genes 96" # Try to fix https://support.bioconductor.org/p/104454/
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", host = ENSEMBL_DB_HOST, version = ENSEMBL_VERSION)
go_sets <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene", "go_id", "name_1006", "definition_1006", "go_linkage_type", "namespace_1003"),
filters = "ensembl_gene_id",
values = gsub("\\..*", "", row.names(res)),
mart = mart
)
(res is a DESeq2 resultset having EnsEMBL gene ids as row.names. I cut out the version tag using the gsub() call).
The stacktrace
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 5308 did not have 8 elements
Traceback:
1. getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene",
. "go_id", "name_1006", "definition_1006", "go_linkage_type",
. "namespace_1003"), filters = "ensembl_gene_id", values = gsub("\\..*",
. "", row.names(res)), mart = mart) # at line 4-8 of file <text>
2. read.table(con, sep = "\t", header = callHeader, quote = quote,
. comment.char = "", check.names = FALSE, stringsAsFactors = FALSE)
3. scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
. nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE,
. fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip,
. multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes,
. flush = flush, encoding = encoding, skipNul = skipNul)
Version info
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] biomaRt_2.38.0 GO.db_3.7.0
[3] org.Hs.eg.db_3.7.0 EnsDb.Hsapiens.v86_2.99.0
[5] ensembldb_2.6.8 AnnotationFilter_1.6.0
[7] GenomicFeatures_1.34.8 AnnotationDbi_1.44.0
[9] forcats_0.4.0 stringr_1.4.0
[11] dplyr_0.8.0.1 purrr_0.3.2
[13] readr_1.3.1 tidyr_0.8.3
[15] tibble_2.1.1 tidyverse_1.2.1
[17] ggrepel_0.8.0 ggplot2_3.1.1
[19] DESeq2_1.22.2 SummarizedExperiment_1.12.0
[21] DelayedArray_0.8.0 matrixStats_0.54.0
[23] Biobase_2.42.0 GenomicRanges_1.34.0
[25] GenomeInfoDb_1.18.2 IRanges_2.16.0
[27] S4Vectors_0.20.1 BiocGenerics_0.28.0
[29] BiocParallel_1.16.6
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 IRdisplay_0.7.0 htmlTable_1.13.1
[4] XVector_0.22.0 base64enc_0.1-3 rstudioapi_0.10
[7] bit64_0.9-7 lubridate_1.7.4 xml2_1.2.0
[10] splines_3.5.1 geneplotter_1.60.0 knitr_1.22
[13] IRkernel_0.8.15 Formula_1.2-3 jsonlite_1.6
[16] Rsamtools_1.34.1 broom_0.5.2 annotate_1.60.1
[19] cluster_2.0.9 compiler_3.5.1 httr_1.4.0
[22] backports_1.1.4 assertthat_0.2.1 Matrix_1.2-17
[25] lazyeval_0.2.2 cli_1.1.0 acepack_1.4.1
[28] htmltools_0.3.6 prettyunits_1.0.2 tools_3.5.1
[31] gtable_0.3.0 glue_1.3.1 GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1 cellranger_1.1.0 Biostrings_2.50.2
[37] nlme_3.1-139 rtracklayer_1.42.2 xfun_0.6
[40] rvest_0.3.3 XML_3.98-1.19 zlibbioc_1.28.0
[43] scales_1.0.0 hms_0.4.2 ProtGenerics_1.14.0
[46] RColorBrewer_1.1-2 curl_3.3 memoise_1.1.0
[49] gridExtra_2.3 rpart_4.1-15 latticeExtra_0.6-28
[52] stringi_1.4.3 RSQLite_2.1.1 genefilter_1.64.0
[55] checkmate_1.9.3 repr_0.19.2 rlang_0.3.4
[58] pkgconfig_2.0.2 bitops_1.0-6 evaluate_0.13
[61] lattice_0.20-38 GenomicAlignments_1.18.1 htmlwidgets_1.2
[64] bit_1.1-14 tidyselect_0.2.5 plyr_1.8.4
[67] magrittr_1.5 R6_2.4.0 generics_0.0.2
[70] Hmisc_4.2-0 pbdZMQ_0.3-3 DBI_1.0.0
[73] pillar_1.3.1 haven_2.1.0 foreign_0.8-71
[76] withr_2.1.2 survival_2.44-1.1 RCurl_1.95-4.12
[79] nnet_7.3-12 modelr_0.1.4 crayon_1.3.4
[82] uuid_0.1-2 progress_1.2.0 locfit_1.5-9.1
[85] grid_3.5.1 readxl_1.3.1 data.table_1.12.2
[88] blob_1.1.1 digest_0.6.18 xtable_1.8-4
[91] munsell_0.5.0
Hi Mike,
thank you very much. I did that.
The error persists today, using
ENSEMBL_DB_HOST = "www.ensembl.org
so you were right and I scaned through the gene ids to provide a minimum example to better reproduce the failure:The culprit was
which yields (abberviated due to size restrictions of the post)
This can be circumvented by not including the 'definition_1006' attribute.
Is there any chance, that such issues could be handled in a more benign fashion in the future?
This is an unusual example in that the offending entry actually has an extra
\n
in the GO description. This is why the solution to change the column orders doesn't work, as it just shifts the erroneous 'new row' around but it will always be there.The simplest solution might actually be to contact Ensembl and try to understand why there is a line break, and either remove it or escape it properly. Otherwise i'm not sure you can be 100% confident that a
\n
doesn't truly represent a new entry.I've started an issue at https://github.com/grimbough/biomaRt/issues/16 and will keep track of any progress there.