Question

summarizeToGene from tximeta: not all 'rnames' found or unique.

1

Entering edit mode

Pufall, Miles A ▴ 30

@pufall-miles-a-5909

Last seen 16 months ago

United States

I'm running the RNA-seq gene-level workflow largely described by Mike Love (http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html). It had been running just fine, but yesterday the workflow stopped and summarizeToGene with the following error:

"rnames' exact pattern
    'Mus_musculus.GRCm38.102.gtf.gz'
  is not unique; use 'bfcquery()' to see matches.Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique."

This is after a fresh retrieval of metadata by tximeta. As I said - this had not caused any problems until yesterday, but I can't seem to solve it even after removing rids from the bfc cache.

Any thoughts? a paste from the .Rmd file is below:

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tximeta) library(DESeq2) library(tidyverse) library("pheatmap") library("RColorBrewer") library("PoiClaClu") library("org.Mm.eg.db") library(BiocFileCache)


## RNA-seq analysis in G9a KO mice

## Trimmed the reads using TrimGalore!

`trim_galore -o Sample-2 Sample-2/Sample-2_1.fq.gz`

This probably isn't 100% necessary, as the reads are single and short, but did it anyway.

## Quantifying counts

I'll use Salmon (v 1.3.0) to make count table (https://combine-lab.github.io/salmon/)

Downloaded transcriptome in fasta format from ensembl (ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz)

As well as the gene structures in GTF format (ftp://ftp.ensembl.org/pub/release-101/gtf/mus_musculus/Mus_musculus.GRCm38.101.gtf.gz)

The first step - index the transcriptome:
`salmon index -t Mus_musculus.GRCm38.cdna.all.fa -i GRCm38_index`

The map the reads:
`#!/bin/bash
for fn in Sample-{1..12};
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
salmon quant -i GRCm38_index  -l A \
        -r ${samp}/${samp}_1_trimmed.fq.gz \
        -p 4 --validateMappings --seqBias --gcBias --numBootstraps 50 -o quants/${samp}_quant
done`

## Import data into R

We'll use tximeta.

```{r}
list.files(file.path("quants"))
files <- file.path("quants", paste0("Sample-", 1:12, "_quant"), "quant.sf")
names <- paste0("Sample_", 1:12)
cell <- factor(c("wt", "wt", "wt", "wt", "wt", "wt", "K182R", "K182R", "K182R", "K182R", "K182R", "K182R"), levels = c("wt","K182R"))
treat <- factor(c("PBS", "PBS", "PBS", "dex", "dex", "dex", "PBS", "PBS", "PBS", "dex", "dex", "dex"), levels = c("PBS", "dex"))
coldata <- tibble(files, names, cell, treat)
file.exists(coldata$files)

[1] "Sample-1_quant" "Sample-10_quant" "Sample-11_quant" "Sample-12_quant" [5] "Sample-2_quant" "Sample-3_quant" "Sample-4_quant" "Sample-5_quant" [9] "Sample-6_quant" "Sample-7_quant" "Sample-8_quant" "Sample-9_quant" [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

se <- tximeta(coldata)
gse <- summarizeToGene(se)

Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique.

A more detailed output of the tximeta and smmarizeToGene is below:

> se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 

tximeta needs a BiocFileCache directory to access and save TxDb objects.
Do you wish to use the default directory: '/home/dweebis/.cache/BiocFileCache'?
If not, a temporary directory that is specific to this R session will be used.

You can always change this directory later by running: setTximetaBFC()
Or enter [0] to exit and set this directory manually now.
This location can also be set by environmental variable TXIMETA_HUB_CACHE. 

1: Yes (use default)
2: No (use temp)

Selection: 1
/home/dweebis/.cache/BiocFileCache
  does not exist, create directory? (yes/no): yes
found matching transcriptome:
[ Ensembl - Mus musculus - release 102 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
snapshotDate(): 2020-10-27
did not find matching EnsDb via 'AnnotationHub'
building EnsDb with 'ensembldb' package
Importing GTF file ... trying URL 'ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz'
Content type 'unknown' length 33443321 bytes (31.9 MB)
==================================================
OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... Warning in if ((One <- nargs() == 1L) && !missing(from)) { :
  closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-102/mysql/)
OK
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... trying URL 'ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz'
Content type 'unknown' length 33443321 bytes (31.9 MB)
==================================================
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.Warning: call dbDisconnect() when finished working with a connection
OK
generating transcript ranges


Warning: the annotation is missing some transcripts that were quantified.
176 out of 117135 txps were missing from GTF/GFF but were in the indexed FASTA.
(This occurs sometimes with Ensembl txps on haplotype chromosomes.)
In order to build a ranged SummarizedExperiment, these txps were removed.
To keep these txps, and to skip adding ranges, use skipMeta=TRUE

Example missing txps: [ENSMUST00000181375, ENSMUST00000214094, ENSMUST00000215103, ...]
> gse <- summarizeToGene(se)
Error in (function (x)  : attempt to apply non-function
'rnames' exact pattern
    'Mus_musculus.GRCm38.102.gtf.gz'
  is not unique; use 'bfcquery()' to see matches.Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique.

thx - Miles

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] BiocFileCache_1.14.0        dbplyr_2.0.0                org.Mm.eg.db_3.12.0        
 [4] AnnotationDbi_1.52.0        PoiClaClu_1.0.2.1           RColorBrewer_1.1-2         
 [7] pheatmap_1.0.12             forcats_0.5.0               stringr_1.4.0              
[10] dplyr_1.0.2                 purrr_0.3.4                 readr_1.4.0                
[13] tidyr_1.1.2                 tibble_3.0.4                ggplot2_3.3.2              
[16] tidyverse_1.3.0             DESeq2_1.30.0               SummarizedExperiment_1.20.0
[19] Biobase_2.50.0              MatrixGenerics_1.2.0        matrixStats_0.57.0         
[22] GenomicRanges_1.42.0        GenomeInfoDb_1.26.1         IRanges_2.24.0             
[25] S4Vectors_0.28.0            BiocGenerics_0.36.0         tximeta_1.8.2              

loaded via a namespace (and not attached):
 [1] colorspace_2.0-0              ellipsis_0.3.1               
 [3] XVector_0.30.0                fs_1.5.0                     
 [5] rstudioapi_0.13               bit64_4.0.5                  
 [7] fansi_0.4.1                   interactiveDisplayBase_1.28.0
 [9] lubridate_1.7.9.2             xml2_1.3.2                   
[11] splines_4.0.3                 tximport_1.18.0              
[13] geneplotter_1.68.0            knitr_1.30                   
[15] jsonlite_1.7.1                Rsamtools_2.6.0              
[17] broom_0.7.2                   annotate_1.68.0              
[19] shiny_1.5.0                   BiocManager_1.30.10          
[21] compiler_4.0.3                httr_1.4.2                   
[23] backports_1.2.0               assertthat_0.2.1             
[25] Matrix_1.2-18                 fastmap_1.0.1                
[27] lazyeval_0.2.2                cli_2.2.0                    
[29] later_1.1.0.1                 htmltools_0.5.0              
[31] prettyunits_1.1.1             tools_4.0.3                  
[33] gtable_0.3.0                  glue_1.4.2                   
[35] GenomeInfoDbData_1.2.4        rappdirs_0.3.1               
[37] tinytex_0.27                  Rcpp_1.0.5                   
[39] cellranger_1.1.0              vctrs_0.3.5                  
[41] Biostrings_2.58.0             rtracklayer_1.50.0           
[43] xfun_0.19                     rvest_0.3.6                  
[45] mime_0.9                      lifecycle_0.2.0              
[47] ensembldb_2.14.0              XML_3.99-0.5                 
[49] AnnotationHub_2.22.0          zlibbioc_1.36.0              
[51] scales_1.1.1                  hms_0.5.3                    
[53] promises_1.1.1                ProtGenerics_1.22.0          
[55] AnnotationFilter_1.14.0       yaml_2.2.1                   
[57] curl_4.3                      memoise_1.1.0                
[59] biomaRt_2.46.0                stringi_1.5.3                
[61] RSQLite_2.2.1                 BiocVersion_3.12.0           
[63] genefilter_1.72.0             GenomicFeatures_1.42.1       
[65] BiocParallel_1.24.1           rlang_0.4.9                  
[67] pkgconfig_2.0.3               bitops_1.0-6                 
[69] evaluate_0.14                 lattice_0.20-41              
[71] GenomicAlignments_1.26.0      bit_4.0.4                    
[73] tidyselect_1.1.0              magrittr_2.0.1               
[75] R6_2.5.0                      generics_0.1.0               
[77] DelayedArray_0.16.0           DBI_1.1.0                    
[79] withr_2.3.0                   pillar_1.4.7                 
[81] haven_2.3.1                   survival_3.2-7               
[83] RCurl_1.98-1.2                modelr_0.1.8                 
[85] crayon_1.3.4                  rmarkdown_2.5                
[87] progress_1.2.2                locfit_1.5-9.4               
[89] grid_4.0.3                    readxl_1.3.1                 
[91] blob_1.2.1                    reprex_0.3.0                 
[93] digest_0.6.27                 xtable_1.8-4                 
[95] httpuv_1.5.4                  openssl_1.4.3                
[97] munsell_0.5.0                 askpass_1.1

RNA-seq tximeta • 2.8k views

ADD COMMENT • link 4.2 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

Hello,

I face the same issue. I do not have a solution, but a workaround. It is certainly a terrible idea and nobody should do it. It will also be overwritten by a future tximeta update.

That said, find your R library path/tximeta/extdata/hashtable.csv

Line 63 holds the information for the ENSEMBL GRCm38 genome, release 102. It has the same SHA256 checksum as the 101 release (and the 100 release, as well; I do not know if that is relevant). You performed your Salmon quantification with the release 101 (as did I), so my best guess is that these get mixed up somehow, but I have no idea about the specifics. Deleting that line allowed my pipeline to run as it did previously.

Good luck!

ADD REPLY • link 4.2 years ago OT • 0

0

Entering edit mode

Thanks for the reply! That does seem like something I shouldn't do...but I may try it if I can't find a way to do it without editing tximeta.

ADD REPLY • link 4.2 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

Maybe this is more appropriate as a BiocFileCache issue?

ADD REPLY • link 4.2 years ago Pufall, Miles A ▴ 30

score 1 · Answer 1 · 2020-12-15

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 22 hours ago

United States

Dear Miles,

Thanks for the bug report and sorry for the delay in response. This skipped my inbox for some reason.

Can you show me:

library(BiocFileCache)
bfc <- BiocFileCache()
as.data.frame( (bfcquery(bfc, "Mus_musculus.GRCm38.102.gtf.gz") )

ADD COMMENT • link 4.2 years ago Michael Love 43k

0

Entering edit mode

I've quantified some mouse data on my end with Ensembl release 102, and I was able to reproduce the bug wherein two entries are stored in the BFC, which leads to this error. This shouldn't ever happen under normal usage so the error was a good thing for helping catch the problem. Tximeta shouldn't add a database to the BFC if it already exists there with the same name.

I'll work on this today.

ADD REPLY • link 4.2 years ago Michael Love 43k

1

Entering edit mode

I've hopefully fixed this in v1.9.3 and v1.8.3 with this commit.

Duplicate entries were being added to the BFC because of a bug in how I was checking cases, introduced in the current release.

In ~2 days this code will be available via BiocManager::install, or you can test it immediately with install_github("mikelove/tximeta") (at the moment, release and devel branch are the same so this shouldn't break anything).

You will need to bfcremove the rids associated with Mus_musculus.GRCm38.102.gtf.gz before trying the new code. You can bfcquery as above and then do, e.g.:

bfcremove(bfc, c("BFC123","BFC234"))

ADD REPLY • link 4.2 years ago Michael Love 43k

0

Entering edit mode

Fantastic - thanks for the help, and I'll report back!

ADD REPLY • link 4.2 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

I update to tximeta from your gitthub, removed previous rids, and everything is running smoothly again. Thanks!

ADD REPLY • link 4.2 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

Sorry for the delay - I'm also not getting email notification of a comment. You've answered this below - but here's the output:

library(BiocFileCache) Loading required package: dbplyr

bfc <- BiocFileCache()

as.data.frame( (bfcquery(bfc, "Mus_musculus.GRCm38.102.gtf.gz") )

)

rid rname create_time access_time

1 BFC1 Mus_musculus.GRCm38.102.gtf.gz 2020-12-06 22:50:44 2020-12-06 22:50:44

2 BFC2 Mus_musculus.GRCm38.102.gtf.gz 2020-12-06 22:52:26 2020-12-06 22:52:26

3 BFC3 txpRngs-Mus_musculus.GRCm38.102.gtf.gz 2020-12-06 22:52:27 2020-12-06 22:52:27 rpath rtype 1 /Users/milespufall/Library/Caches/BiocFileCache/54305640c910_54305640c910.sqlite relative 2 /Users/milespufall/Library/Caches/BiocFileCache/54301fed178a_54301fed178a.sqlite relative 3 /Users/milespufall/Library/Caches/BiocFileCache/54306a47566_54306a47566.rds relative

fpath last_modified_time etag expires

1 54305640c910 NA <NA> NA

2 54301fed178a NA <NA> NA

3 54306a47566 NA <NA> NA

ADD REPLY • link 4.2 years ago Pufall, Miles A ▴ 30