tximport: Error in medianLengthOverIsoform
2
0
Entering edit mode
d-joester ▴ 20
@d-joester-19665
Last seen 5.7 years ago

Hi,

I am very new to RNAseq, bioconductor, and R. So far, I have successfully established my workflow for DGE of my data, using Salmon, to create transcript-level count tables; tximport, and DESeq2 to aggregate and perform gene level analysis. The following code works as expected:

samples <- read.table(file.path("somepath","samples.txt"), header=TRUE)
rownames(samples) <- samples$run

files <- file.path("somepath", paste(samples$run,"quant",sep="_"), "quant.sf")
names(files) <- samples$run

tx2gene <- read_csv(file.path("somepath", "tx2gene.csv")) 

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ condition)

dds <- ddsTxi[rowSums(counts(ddsTxi)) >= 1,]
dds <- DESeq(dds)

I am now trying to run a DTU analysis. Following a published workflow rnaseqDTU, I used

txi <- tximport(files, type="salmon", txOut=TRUE,
            countsFromAbundance="scaledTPM")

to import the Salmon data instead. This also seems to work fine. However, in the documentation for tximport, I found a reference to using the option countsFromAbundance="dtuScaledTPM", which requires also passing the tx2gene dataframe. When I use

txi <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="dtuScaledTPM",tx2gene=tx2gene)

instead, I get the following output:

reading in files with read_tsv

1 2 3 4 5 6 7 8 9 10 11 12 Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion, :
all(txId %in% tx2gene$tx) is not TRUE

I'd appreciate some help figuring out what is going on here. It sounds like there is a set of transcripts (txId) that is not fully contained in the set that I provided in tx2gene. I suspect that txId is derived from the quant.sf files generated using Salmon, but it would help to be certain. It would be even better if there was a way to identify the missing transcripts.

On a more general level, I did not find much information regarding when to use scaledTPM and when to use dtuScaledTPM as options (for DTU analysis). Any suggestions are welcome.

Thank you very much,

Derk

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/derk/anaconda3/envs/salmon/lib/R/lib/libRblas.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fission_1.2.0               apeglm_1.4.2                PoiClaClu_1.0.2.1          
 [4] RColorBrewer_1.1-2          pheatmap_1.0.12             hexbin_1.27.2              
 [7] ggplot2_3.1.0               dplyr_0.7.8                 tximportData_1.10.0        
[10] readr_1.3.1                 tximport_1.10.1             DESeq2_1.22.2              
[13] SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[16] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0       
[19] GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[22] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            numDeriv_2016.8-1      tools_3.5.1           
 [5] backports_1.1.3        utf8_1.1.4             R6_2.3.0               rpart_4.1-13          
 [9] Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.1         colorspace_1.4-0      
[13] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3         
[17] bit_1.1-14             compiler_3.5.1         cli_1.0.1              htmlTable_1.13.1      
[21] labeling_0.3           scales_1.0.0           checkmate_1.9.1        genefilter_1.64.0     
[25] stringr_1.3.1          digest_0.6.18          foreign_0.8-71         XVector_0.22.0        
[29] base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6        bbmle_1.0.20          
[33] htmlwidgets_1.3        rlang_0.3.1            rstudioapi_0.9.0       RSQLite_2.1.1         
[37] bindr_0.1.1            jsonlite_1.6           acepack_1.4.1          RCurl_1.95-4.11       
[41] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3          Matrix_1.2-15         
[45] Rcpp_1.0.0             munsell_0.5.0          fansi_0.4.0            stringi_1.2.4         
[49] yaml_2.2.0             MASS_7.3-51.1          zlibbioc_1.28.0        plyr_1.8.4            
[53] grid_3.5.1             blob_1.1.1             crayon_1.3.4           lattice_0.20-38       
[57] splines_3.5.1          annotate_1.60.0        hms_0.4.2              locfit_1.5-9.1        
[61] knitr_1.21             pillar_1.3.1           geneplotter_1.60.0     reshape2_1.4.3        
[65] XML_3.98-1.16          glue_1.3.0             latticeExtra_0.6-28    data.table_1.12.0     
[69] BiocManager_1.30.4     gtable_0.2.0           purrr_0.3.0            assertthat_0.2.0      
[73] emdbook_1.3.10         xfun_0.4               xtable_1.8-3           coda_0.19-2           
[77] survival_2.43-3        tibble_2.0.1           AnnotationDbi_1.44.0   memoise_1.1.0         
[81] bindrcpp_0.2.2         cluster_2.0.7-1
tximport deseq2 dexseq • 2.1k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 5 days ago
United States

It's required that you provide the gene ID for all of the transcripts in the Salmon quants. And it has to be exact, or you have to use some of the arguments that specify what should be stripped from the Salmon transcript ID to make it match your tx2gene table. Can you show tx2gene? And what transcripts did you use with Salmon?

ADD COMMENT
0
Entering edit mode

Michael,

thanks for your quick response. I used Salmon to create an index of the reference transcriptome for S. purpuratus hosted on Echinobase (link. My tx2name dataframe looks like this:

> head(tx2gene)

# A tibble: 6 x 3
  TXNAME         GENEID       ntx        
  <chr>          <chr>        <S3: table>
1 WHL22.324130.0 WHL22.324130 1          
2 WHL22.324131.1 WHL22.324131 2          
3 WHL22.224727.1 WHL22.224727 2          
4 WHL22.100119.0 WHL22.100119 1          
5 WHL22.100252.0 WHL22.100252 1          
6 WHL22.100256.0 WHL22.100256 1

I poked around my files in the meantime and realized that for some reason 3 transcripts went missing from my csv file. I added these back in and things started working again. I'd still be glad if you could comment on the use case for countsFromAbundance="dtuScaledTPM" vs countsFromAbundance="scaledTPM".

In the meantime, I encountered another issue further downstream, with stageR. I'll put this in a different question.

Thank you again, Derk

ADD REPLY
0
Entering edit mode

I didn’t benchmark dtuScaledTPM but added to tximport to allow for later benchmarking. I think the technical description is in ?tximport. dtuScaledTPM should be closer to the original counts than scaledTPM which is desirable for information about the precision to propagate to statistical methods.

ADD REPLY
0
Entering edit mode
d-joester ▴ 20
@d-joester-19665
Last seen 5.7 years ago

Poking around, I found that

setdiff(rownames(cts),tx2gene$TXNAME)

revealed that tx2gene was missing three transcripts. I added these to my csv file and the issue disappeared.

ADD COMMENT

Login before adding your answer.

Traffic: 499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6