Hi,
I am very new to RNAseq, bioconductor, and R. So far, I have successfully established my workflow for DGE of my data, using Salmon, to create transcript-level count tables; tximport, and DESeq2 to aggregate and perform gene level analysis. The following code works as expected:
samples <- read.table(file.path("somepath","samples.txt"), header=TRUE)
rownames(samples) <- samples$run
files <- file.path("somepath", paste(samples$run,"quant",sep="_"), "quant.sf")
names(files) <- samples$run
tx2gene <- read_csv(file.path("somepath", "tx2gene.csv"))
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ condition)
dds <- ddsTxi[rowSums(counts(ddsTxi)) >= 1,]
dds <- DESeq(dds)
I am now trying to run a DTU analysis. Following a published workflow rnaseqDTU, I used
txi <- tximport(files, type="salmon", txOut=TRUE,
countsFromAbundance="scaledTPM")
to import the Salmon data instead. This also seems to work fine. However, in the documentation for tximport, I found a reference to using the option countsFromAbundance="dtuScaledTPM", which requires also passing the tx2gene dataframe. When I use
txi <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="dtuScaledTPM",tx2gene=tx2gene)
instead, I get the following output:
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion, :
all(txId %in% tx2gene$tx) is not TRUE
I'd appreciate some help figuring out what is going on here. It sounds like there is a set of transcripts (txId) that is not fully contained in the set that I provided in tx2gene. I suspect that txId is derived from the quant.sf files generated using Salmon, but it would help to be certain. It would be even better if there was a way to identify the missing transcripts.
On a more general level, I did not find much information regarding when to use scaledTPM and when to use dtuScaledTPM as options (for DTU analysis). Any suggestions are welcome.
Thank you very much,
Derk
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS 10.14.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/derk/anaconda3/envs/salmon/lib/R/lib/libRblas.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] fission_1.2.0 apeglm_1.4.2 PoiClaClu_1.0.2.1
[4] RColorBrewer_1.1-2 pheatmap_1.0.12 hexbin_1.27.2
[7] ggplot2_3.1.0 dplyr_0.7.8 tximportData_1.10.0
[10] readr_1.3.1 tximport_1.10.1 DESeq2_1.22.2
[13] SummarizedExperiment_1.12.0 DelayedArray_0.8.0 BiocParallel_1.16.5
[16] matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0
[19] GenomeInfoDb_1.18.1 IRanges_2.16.0 S4Vectors_0.20.1
[22] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 numDeriv_2016.8-1 tools_3.5.1
[5] backports_1.1.3 utf8_1.1.4 R6_2.3.0 rpart_4.1-13
[9] Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.1 colorspace_1.4-0
[13] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3
[17] bit_1.1-14 compiler_3.5.1 cli_1.0.1 htmlTable_1.13.1
[21] labeling_0.3 scales_1.0.0 checkmate_1.9.1 genefilter_1.64.0
[25] stringr_1.3.1 digest_0.6.18 foreign_0.8-71 XVector_0.22.0
[29] base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 bbmle_1.0.20
[33] htmlwidgets_1.3 rlang_0.3.1 rstudioapi_0.9.0 RSQLite_2.1.1
[37] bindr_0.1.1 jsonlite_1.6 acepack_1.4.1 RCurl_1.95-4.11
[41] magrittr_1.5 GenomeInfoDbData_1.2.0 Formula_1.2-3 Matrix_1.2-15
[45] Rcpp_1.0.0 munsell_0.5.0 fansi_0.4.0 stringi_1.2.4
[49] yaml_2.2.0 MASS_7.3-51.1 zlibbioc_1.28.0 plyr_1.8.4
[53] grid_3.5.1 blob_1.1.1 crayon_1.3.4 lattice_0.20-38
[57] splines_3.5.1 annotate_1.60.0 hms_0.4.2 locfit_1.5-9.1
[61] knitr_1.21 pillar_1.3.1 geneplotter_1.60.0 reshape2_1.4.3
[65] XML_3.98-1.16 glue_1.3.0 latticeExtra_0.6-28 data.table_1.12.0
[69] BiocManager_1.30.4 gtable_0.2.0 purrr_0.3.0 assertthat_0.2.0
[73] emdbook_1.3.10 xfun_0.4 xtable_1.8-3 coda_0.19-2
[77] survival_2.43-3 tibble_2.0.1 AnnotationDbi_1.44.0 memoise_1.1.0
[81] bindrcpp_0.2.2 cluster_2.0.7-1
Michael,
thanks for your quick response. I used Salmon to create an index of the reference transcriptome for S. purpuratus hosted on Echinobase (link. My tx2name dataframe looks like this:
I poked around my files in the meantime and realized that for some reason 3 transcripts went missing from my csv file. I added these back in and things started working again. I'd still be glad if you could comment on the use case for
countsFromAbundance="dtuScaledTPM"
vscountsFromAbundance="scaledTPM"
.In the meantime, I encountered another issue further downstream, with stageR. I'll put this in a different question.
Thank you again, Derk
I didn’t benchmark dtuScaledTPM but added to tximport to allow for later benchmarking. I think the technical description is in ?tximport. dtuScaledTPM should be closer to the original counts than scaledTPM which is desirable for information about the precision to propagate to statistical methods.