I'm running the RNA-seq gene-level workflow largely described by Mike Love (http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html). It had been running just fine, but yesterday the workflow stopped and summarizeToGene with the following error:
"rnames' exact pattern
'Mus_musculus.GRCm38.102.gtf.gz'
is not unique; use 'bfcquery()' to see matches.Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique."
This is after a fresh retrieval of metadata by tximeta. As I said - this had not caused any problems until yesterday, but I can't seem to solve it even after removing rids from the bfc cache.
Any thoughts? a paste from the .Rmd file is below:
```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tximeta) library(DESeq2) library(tidyverse) library("pheatmap") library("RColorBrewer") library("PoiClaClu") library("org.Mm.eg.db") library(BiocFileCache)
## RNA-seq analysis in G9a KO mice
## Trimmed the reads using TrimGalore!
`trim_galore -o Sample-2 Sample-2/Sample-2_1.fq.gz`
This probably isn't 100% necessary, as the reads are single and short, but did it anyway.
## Quantifying counts
I'll use Salmon (v 1.3.0) to make count table (https://combine-lab.github.io/salmon/)
Downloaded transcriptome in fasta format from ensembl (ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz)
As well as the gene structures in GTF format (ftp://ftp.ensembl.org/pub/release-101/gtf/mus_musculus/Mus_musculus.GRCm38.101.gtf.gz)
The first step - index the transcriptome:
`salmon index -t Mus_musculus.GRCm38.cdna.all.fa -i GRCm38_index`
The map the reads:
`#!/bin/bash
for fn in Sample-{1..12};
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
salmon quant -i GRCm38_index -l A \
-r ${samp}/${samp}_1_trimmed.fq.gz \
-p 4 --validateMappings --seqBias --gcBias --numBootstraps 50 -o quants/${samp}_quant
done`
## Import data into R
We'll use tximeta.
```{r}
list.files(file.path("quants"))
files <- file.path("quants", paste0("Sample-", 1:12, "_quant"), "quant.sf")
names <- paste0("Sample_", 1:12)
cell <- factor(c("wt", "wt", "wt", "wt", "wt", "wt", "K182R", "K182R", "K182R", "K182R", "K182R", "K182R"), levels = c("wt","K182R"))
treat <- factor(c("PBS", "PBS", "PBS", "dex", "dex", "dex", "PBS", "PBS", "PBS", "dex", "dex", "dex"), levels = c("PBS", "dex"))
coldata <- tibble(files, names, cell, treat)
file.exists(coldata$files)
[1] "Sample-1_quant" "Sample-10_quant" "Sample-11_quant" "Sample-12_quant" [5] "Sample-2_quant" "Sample-3_quant" "Sample-4_quant" "Sample-5_quant" [9] "Sample-6_quant" "Sample-7_quant" "Sample-8_quant" "Sample-9_quant" [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
se <- tximeta(coldata)
gse <- summarizeToGene(se)
Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique.
A more detailed output of the tximeta and smmarizeToGene is below:
> se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12
tximeta needs a BiocFileCache directory to access and save TxDb objects.
Do you wish to use the default directory: '/home/dweebis/.cache/BiocFileCache'?
If not, a temporary directory that is specific to this R session will be used.
You can always change this directory later by running: setTximetaBFC()
Or enter [0] to exit and set this directory manually now.
This location can also be set by environmental variable TXIMETA_HUB_CACHE.
1: Yes (use default)
2: No (use temp)
Selection: 1
/home/dweebis/.cache/BiocFileCache
does not exist, create directory? (yes/no): yes
found matching transcriptome:
[ Ensembl - Mus musculus - release 102 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
snapshotDate(): 2020-10-27
did not find matching EnsDb via 'AnnotationHub'
building EnsDb with 'ensembldb' package
Importing GTF file ... trying URL 'ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz'
Content type 'unknown' length 33443321 bytes (31.9 MB)
==================================================
OK
Processing metadata ... OK
Processing genes ...
Attribute availability:
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
OK
Processing transcripts ...
Attribute availability:
o transcript_id ... OK
o gene_id ... OK
o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
-------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... Warning in if ((One <- nargs() == 1L) && !missing(from)) { :
closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-102/mysql/)
OK
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... trying URL 'ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz'
Content type 'unknown' length 33443321 bytes (31.9 MB)
==================================================
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.Warning: call dbDisconnect() when finished working with a connection
OK
generating transcript ranges
Warning: the annotation is missing some transcripts that were quantified.
176 out of 117135 txps were missing from GTF/GFF but were in the indexed FASTA.
(This occurs sometimes with Ensembl txps on haplotype chromosomes.)
In order to build a ranged SummarizedExperiment, these txps were removed.
To keep these txps, and to skip adding ranges, use skipMeta=TRUE
Example missing txps: [ENSMUST00000181375, ENSMUST00000214094, ENSMUST00000215103, ...]
> gse <- summarizeToGene(se)
Error in (function (x) : attempt to apply non-function
'rnames' exact pattern
'Mus_musculus.GRCm38.102.gtf.gz'
is not unique; use 'bfcquery()' to see matches.Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique.
thx - Miles
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] BiocFileCache_1.14.0 dbplyr_2.0.0 org.Mm.eg.db_3.12.0
[4] AnnotationDbi_1.52.0 PoiClaClu_1.0.2.1 RColorBrewer_1.1-2
[7] pheatmap_1.0.12 forcats_0.5.0 stringr_1.4.0
[10] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0
[13] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2
[16] tidyverse_1.3.0 DESeq2_1.30.0 SummarizedExperiment_1.20.0
[19] Biobase_2.50.0 MatrixGenerics_1.2.0 matrixStats_0.57.0
[22] GenomicRanges_1.42.0 GenomeInfoDb_1.26.1 IRanges_2.24.0
[25] S4Vectors_0.28.0 BiocGenerics_0.36.0 tximeta_1.8.2
loaded via a namespace (and not attached):
[1] colorspace_2.0-0 ellipsis_0.3.1
[3] XVector_0.30.0 fs_1.5.0
[5] rstudioapi_0.13 bit64_4.0.5
[7] fansi_0.4.1 interactiveDisplayBase_1.28.0
[9] lubridate_1.7.9.2 xml2_1.3.2
[11] splines_4.0.3 tximport_1.18.0
[13] geneplotter_1.68.0 knitr_1.30
[15] jsonlite_1.7.1 Rsamtools_2.6.0
[17] broom_0.7.2 annotate_1.68.0
[19] shiny_1.5.0 BiocManager_1.30.10
[21] compiler_4.0.3 httr_1.4.2
[23] backports_1.2.0 assertthat_0.2.1
[25] Matrix_1.2-18 fastmap_1.0.1
[27] lazyeval_0.2.2 cli_2.2.0
[29] later_1.1.0.1 htmltools_0.5.0
[31] prettyunits_1.1.1 tools_4.0.3
[33] gtable_0.3.0 glue_1.4.2
[35] GenomeInfoDbData_1.2.4 rappdirs_0.3.1
[37] tinytex_0.27 Rcpp_1.0.5
[39] cellranger_1.1.0 vctrs_0.3.5
[41] Biostrings_2.58.0 rtracklayer_1.50.0
[43] xfun_0.19 rvest_0.3.6
[45] mime_0.9 lifecycle_0.2.0
[47] ensembldb_2.14.0 XML_3.99-0.5
[49] AnnotationHub_2.22.0 zlibbioc_1.36.0
[51] scales_1.1.1 hms_0.5.3
[53] promises_1.1.1 ProtGenerics_1.22.0
[55] AnnotationFilter_1.14.0 yaml_2.2.1
[57] curl_4.3 memoise_1.1.0
[59] biomaRt_2.46.0 stringi_1.5.3
[61] RSQLite_2.2.1 BiocVersion_3.12.0
[63] genefilter_1.72.0 GenomicFeatures_1.42.1
[65] BiocParallel_1.24.1 rlang_0.4.9
[67] pkgconfig_2.0.3 bitops_1.0-6
[69] evaluate_0.14 lattice_0.20-41
[71] GenomicAlignments_1.26.0 bit_4.0.4
[73] tidyselect_1.1.0 magrittr_2.0.1
[75] R6_2.5.0 generics_0.1.0
[77] DelayedArray_0.16.0 DBI_1.1.0
[79] withr_2.3.0 pillar_1.4.7
[81] haven_2.3.1 survival_3.2-7
[83] RCurl_1.98-1.2 modelr_0.1.8
[85] crayon_1.3.4 rmarkdown_2.5
[87] progress_1.2.2 locfit_1.5-9.4
[89] grid_4.0.3 readxl_1.3.1
[91] blob_1.2.1 reprex_0.3.0
[93] digest_0.6.27 xtable_1.8-4
[95] httpuv_1.5.4 openssl_1.4.3
[97] munsell_0.5.0 askpass_1.1
Hello,
I face the same issue. I do not have a solution, but a workaround. It is certainly a terrible idea and nobody should do it. It will also be overwritten by a future tximeta update.
That said, find your R library path/tximeta/extdata/hashtable.csv
Line 63 holds the information for the ENSEMBL GRCm38 genome, release 102. It has the same SHA256 checksum as the 101 release (and the 100 release, as well; I do not know if that is relevant). You performed your Salmon quantification with the release 101 (as did I), so my best guess is that these get mixed up somehow, but I have no idea about the specifics. Deleting that line allowed my pipeline to run as it did previously.
Good luck!
Thanks for the reply! That does seem like something I shouldn't do...but I may try it if I can't find a way to do it without editing tximeta.
Maybe this is more appropriate as a BiocFileCache issue?