Hi all, I'm seeing a bit of an unusually error that I've never come across before - hoping for some assistance. I had posted to Biostars, recommendations were to put it up here.
I quantified pair-end RNA seq data using Salmon, loaded it into RStudio via tximeta and continued with the general Bioconductor DESeq2 pipeline structure. Everything is working as expected. However, when I ask one of my project students to rerun the analysis on their machine, he runs into the error:
gse <- summarizeToGene(se)
loading existing EnsDb created: 2020-09-21 10:50:18
obtaining transcript-to-gene mapping from database
loading existing gene ranges created: 2020-09-21 10:56:56
summarizing abundance
summarizing counts
summarizing length
Error in abundanceMatTx * lengthMatTx :
non-numeric argument to binary operator
I've never come across this error before. From the face of things, R is complaining that a non-numeric value eg. a name of something is being passed into a binary operator e.g., "+". I'm having the student double check the meta-data that makes up the prior call to se <- tximeta(colDataDF). However, one thought that came to mind were the quant.sf files generated by Salmon.
Due to Covid lockdown we're having to do all of this remotely and sadly, the project student doesn't have the bandwidth at home to download the RNA seq data and process the reads using Salmon. Therefore, he's using the quantified .sf files I generated, a fraction of the size. My initial concern is some machine-dependent association with the quant.sf files.
I noticed we had different versions of tximeta, so I brought my version up to 1.7.14. There are quite a few changes in the object attributes returned by summarizeToGene between my original version of tximeta (1.7.4) and the later version, however, my machine, can still execute summarizeToGene() without throwing an error.
We tried using tximport as an alternative to tximeta. The tximport::summarizeToGene() yields a valid object on my machine, however, fails with a similar error to the one above on my students machine.
This is my R setup.
SessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] KEGGgraph_1.49.1 gprofiler2_0.2.0 ReportingTools_2.29.0
[4] knitr_1.29 org.Hs.eg.db_3.11.4 AnnotationDbi_1.51.3
[7] RColorBrewer_1.1-2 pheatmap_1.0.12 genefilter_1.71.0
[10] magrittr_1.5 dplyr_1.0.1 GGally_2.0.0
[13] tidyr_1.1.2 broom_0.7.0 ggplot2_3.3.2
[16] vsn_3.57.0 DESeq2_1.29.13 tximeta_1.7.14
[19] SummarizedExperiment_1.19.6 DelayedArray_0.15.4 matrixStats_0.56.0
[22] Matrix_1.2-18 Biobase_2.49.0 GenomicRanges_1.41.5
[25] GenomeInfoDb_1.25.10 IRanges_2.23.10 S4Vectors_0.27.12
[28] BiocGenerics_0.35.4
loaded via a namespace (and not attached):
[1] backports_1.1.10 GOstats_2.55.0 Hmisc_4.4-1
[4] AnnotationHub_2.21.5 BiocFileCache_1.13.1 plyr_1.8.6
[7] lazyeval_0.2.2 GSEABase_1.51.1 splines_4.0.2
[10] BiocParallel_1.23.2 digest_0.6.25 ensembldb_2.13.1
[13] htmltools_0.5.0 GO.db_3.11.4 checkmate_2.0.0
[16] memoise_1.1.0 BSgenome_1.57.6 cluster_2.1.0
[19] limma_3.45.14 Biostrings_2.57.2 annotate_1.67.1
[22] R.utils_2.10.1 ggbio_1.37.0 askpass_1.1
[25] prettyunits_1.1.1 jpeg_0.1-8.1 colorspace_1.4-1
[28] blob_1.2.1 rappdirs_0.3.1 xfun_0.16
[31] crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.7.1
[34] tximport_1.17.6 graph_1.67.1 VariantAnnotation_1.35.3
[37] survival_3.2-3 glue_1.4.1 gtable_0.3.0
[40] zlibbioc_1.35.0 XVector_0.29.3 Rgraphviz_2.33.0
[43] scales_1.1.1 DBI_1.1.0 edgeR_3.31.4
[46] Rcpp_1.0.5 viridisLite_0.3.0 xtable_1.8-4
[49] progress_1.2.2 htmlTable_2.1.0 foreign_0.8-80
[52] bit_4.0.4 OrganismDbi_1.31.0 preprocessCore_1.51.0
[55] Formula_1.2-3 AnnotationForge_1.31.2 htmlwidgets_1.5.1
[58] httr_1.4.2 ellipsis_0.3.1 R.methodsS3_1.8.1
[61] pkgconfig_2.0.3 reshape_0.8.8 XML_3.99-0.5
[64] nnet_7.3-14 dbplyr_1.4.4 locfit_1.5-9.4
[67] tidyselect_1.1.0 rlang_0.4.7 reshape2_1.4.4
[70] later_1.1.0.1 munsell_0.5.0 BiocVersion_3.12.0
[73] tools_4.0.2 generics_0.0.2 RSQLite_2.2.0
[76] stringr_1.4.0 fastmap_1.0.1 yaml_2.2.1
[79] bit64_4.0.5 purrr_0.3.4 AnnotationFilter_1.13.0
[82] RBGL_1.65.0 mime_0.9 R.oo_1.24.0
[85] biomaRt_2.45.2 compiler_4.0.2 rstudioapi_0.11
[88] plotly_4.9.2.1 curl_4.3 png_0.1-7
[91] interactiveDisplayBase_1.27.5 affyio_1.59.0 PFAM.db_3.11.4
[94] tibble_3.0.3 geneplotter_1.67.0 stringi_1.5.3
[97] GenomicFeatures_1.41.3 lattice_0.20-41 ProtGenerics_1.21.0
[100] vctrs_0.3.2 pillar_1.4.6 lifecycle_0.2.0
[103] BiocManager_1.30.10 data.table_1.13.0 bitops_1.0-6
[106] httpuv_1.5.4 rtracklayer_1.49.5 hwriter_1.3.2
[109] R6_2.4.1 latticeExtra_0.6-29 affy_1.67.0
[112] promises_1.1.1 gridExtra_2.3 dichromat_2.0-0
[115] assertthat_0.2.1 openssl_1.4.3 Category_2.55.0
[118] withr_2.2.0 GenomicAlignments_1.25.3 Rsamtools_2.5.3
[121] GenomeInfoDbData_1.2.3 hms_0.5.3 grid_4.0.2
[124] rpart_4.1-15 biovizBase_1.37.0 shiny_1.5.0
[127] base64enc_0.1-3
This is the setup that fails to generate a summarizeToGene result:
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] tximeta_1.7.14 tximport_1.17.6 SummarizedExperiment_1.19.6
[4] DelayedArray_0.15.8 matrixStats_0.56.0 Matrix_1.2-18
[7] Biobase_2.49.1 GenomicRanges_1.41.6 GenomeInfoDb_1.25.11
[10] IRanges_2.23.10 S4Vectors_0.27.13 BiocGenerics_0.35.4
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.1 bit64_4.0.5
[4] AnnotationHub_2.21.5 shiny_1.5.0 assertthat_0.2.1
[7] askpass_1.1 interactiveDisplayBase_1.27.5 BiocManager_1.30.10
[10] BiocFileCache_1.13.1 blob_1.2.1 Rsamtools_2.5.3
[13] GenomeInfoDbData_1.2.3 yaml_2.2.1 progress_1.2.2
[16] BiocVersion_3.12.0 pillar_1.4.6 RSQLite_2.2.0
[19] lattice_0.20-41 glue_1.4.2 digest_0.6.25
[22] promises_1.1.1 XVector_0.29.3 htmltools_0.5.0
[25] httpuv_1.5.4 XML_3.99-0.5 pkgconfig_2.0.3
[28] biomaRt_2.45.2 zlibbioc_1.35.0 purrr_0.3.4
[31] xtable_1.8-4 later_1.1.0.1 BiocParallel_1.23.2
[34] tibble_3.0.3 openssl_1.4.3 AnnotationFilter_1.13.0
[37] generics_0.0.2 ellipsis_0.3.1 GenomicFeatures_1.41.3
[40] lazyeval_0.2.2 magrittr_1.5 crayon_1.3.4
[43] mime_0.9 memoise_1.1.0 evaluate_0.14
[46] tools_4.0.2 prettyunits_1.1.1 hms_0.5.3
[49] lifecycle_0.2.0 stringr_1.4.0 ensembldb_2.13.1
[52] AnnotationDbi_1.51.3 Biostrings_2.57.2 compiler_4.0.2
[55] tinytex_0.25 rlang_0.4.7 grid_4.0.2
[58] RCurl_1.98-1.2 rstudioapi_0.11 rappdirs_0.3.1
[61] bitops_1.0-6 rmarkdown_2.3 DBI_1.1.0
[64] curl_4.3 R6_2.4.1 GenomicAlignments_1.25.3
[67] rtracklayer_1.49.5 knitr_1.29 dplyr_1.0.2
[70] fastmap_1.0.1 bit_4.0.4 ProtGenerics_1.21.0
[73] stringi_1.5.3 Rcpp_1.0.5 vctrs_0.3.4
[76] dbplyr_1.4.4 tidyselect_1.1.0 xfun_0.17
Suggestions are most welcome.
Any luck hunting down the bug from the database side?
Hi Mike, sorry to have left this hanging, I was waiting to hear back from the student after the weekend. It's very difficult getting to the bottom of this whilst we are all working remotely. I'm still waiting for the feedback from those commands you supplied, but in the mean while after he managed to get the raw reads (so no longer using the quant files I gave him), he's now seeing:
I have a feeling this is all tied together with the meta_info.json file in each quant folder and the EnsDb you mentioned. I'm just not sure how as I've never had to delve this deep into tximeta as it's always worked for me.
I've just got back the information required: This is my setup:
This is his:
Thanks for the update. The new error above indicates that samples are being imported that were run with different Salmon index (and so the transcripts will not align to build a matrix of count data). That may be a clue to the other error (or it could be independent).
You can do:
To see which files were quantified with a different set of transcripts according to its hash value. You can also look at:
To see the path of the index that was provided to Salmon.
Thanks very much Mike.
Had a look at what his terminal spits out. The checksum is identical to mine (we used the same reference transcriptome), and index reference is what it should be (pointing to the reference he used during the indexing stage of salmon). Would this suggest that this checksum entry is not in his distribution of Tximeta? <-- I'm looking at the figure in the tximeta html document.
Also, I don't know whether the reference transcriptome file name is of a standard i.e., a standard set by Ensembl. The file was renamed on his machine, and I'm seeing:
I'm reading that Ensembl transcripts sould be automatically recognised by tximeta. I'm just taking a stab at the dark here.
I've looked through the info.json of the reference index folder, the "indexseqhash" matches what's in his auxinfo/metainfo.json files. An interesting development, and I have no idea if it would make any difference. He ran salmon index a second from the seventh sample onwards. I've checked for difference in the metadata of those samples that came before and those that came after, I can't spot anything obvious.
So the error indicates that the reference transcripts on his machine are not identical across samples, which kind of breaks the import process (e.g. tximport/tximeta won't be able to merge to create a single object).
Can you explain this more:
This sounds like the source of an error that would break tximport/tximeta.
Indeed! Again, this is all remote some I'm trying to get to the bottom of it. I have a transcript of the commands used on the Mac terminal. From the looks of it, the first few samples he performed Salmon quant having first performed a salmon index. The machine was powered off. The next day it was powered on. An attempt was made at running salmon quant on the next sample but failed with:
I've never seen that error before. Anyway, that's when salmon index was run a second time. It ran, then the remaining samples were processed with salmon quant. I took a look at what files I could get hold of. The json files before and after the second salmon index instantiation look the same.
Update: It is most certainly the second execution of salmon index that broke his pipeline. Thank you Mike for being extremely patient and helpful.
Thanks for the report. I think that I can look into a better error message... for some reason the function got to summarization when it should have caught on the mis-matching index across samples first.
Thanks again. I should have diagnosed the problem a lot earlier, but with supervision being remote, technical support for new project students is really tough!