Question

Using tximport on rsem files produces error while trying to read in files

1

Entering edit mode

stephan.emmrich ▴ 10

@stephanemmrich-15373

Last seen 6.7 years ago

Hi,

I am trying to import RSEM output files into tximport. Using this code

sTPM_hsa <- tximport(files = rsem_files, type = "rsem", txIn = TRUE, txOut = FALSE, countsFromAbundance = "scaledTPM")

with

rsem_files <- list.files(sTPM_basedir, pattern = "genes", full.names = TRUE)
rsem_files <- rsem_files[file.info(rsem_files)$isdir]
rsem_files <- paste0(rsem_files, "/sTPM.sf")
rsem_files <- rsem_files[file.exists(rsem_files)]
names(rsem_files) <- basename(gsub("/sTPM.sf", "", rsem_files))

I get this:

Error in if (is.null(importer) & !kallisto.h5) { :
missing value where TRUE/FALSE needed

That points to line 168 in tximport.R, but I don"t get the problem?! I have the readr package installed, also even if not it should use another function to read in the files...

I could not find this issue in any related post, hopefully its not a too stupid mistake...

Appreciate any help!

Thank You,

Stephan

R tximport rnaseq tpm rsem • 3.8k views

ADD COMMENT • link 6.7 years ago stephan.emmrich ▴ 10

0

Entering edit mode

I have 1.6. just downloaded today from Bioconductor

ADD REPLY • link 6.7 years ago stephan.emmrich ▴ 10

0

Entering edit mode

Thanks!

When I do this, then

> install_github(“mikelove/tximport”)
Error: unexpected input in "install_github(“"

I am using RStudio1.1.442 with R3.4.4 on a windows 7 PC, could windows be the problem here?

ADD REPLY • link 6.7 years ago stephan.emmrich ▴ 10

0

Entering edit mode

It may be curly quotes? Try typing it out instead of copy paste

ADD REPLY • link 6.7 years ago Michael Love 43k

Michael Love · Answer 1 · 2018-03-29

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

What version of tximport do you have?

ADD COMMENT • link 6.7 years ago Michael Love 43k

0

Entering edit mode

Can you try this and see if it fixes the error:

library(devtools)

install_github(“mikelove/tximport”)

Then restart R

ADD REPLY • link 6.7 years ago Michael Love 43k

0

Entering edit mode

Great, that works! Now I have tximport 1.7.13.

But sadly that gives me same error as from the original post with that commands:

Error in if (is.null(importer) & !kallisto.h5) { : 
  missing value where TRUE/FALSE needed

Weirdly, when I run a sample to try out tximport v.1.7.13:

sTPM_hsa_TEST <- tximport(files = "./sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg.GRCh38.20150622.RESULTS", type = "rsem", countsFromAbundance = "scaledTPM")

it gives this error:

reading in files with read_tsv
1 Warning: 120966 parsing failures.
row # A tibble: 5 x 5 col     row col    expected               actual     file                                                            expected   <int> <chr>  <chr>                  <chr>      <chr>                                                           actual 1     1 length no trailing characters .10        './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ file 2     1 NA     8 columns              15 columns './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ row 3     2 length no trailing characters .50        './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ col 4     2 NA     8 columns              15 columns './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ expected

But when I run that same sample code to try out tximport v.1.6.00, then the output has a "no" for every row in the countsFromAbundance column!?

So the simple one-sample tximport run with a local path gets a parsing error or does not make scaledTPMs, depending on version. The complex tximport run with multiple files gets stuck before anything is even read in. Maybe a problem with my variables or pathnames? Here again everything I typed for the multiple files run from your paper:

basedir <- "E:/work/Dropbox/Bioware/Rochester/HSC/RNA Seq/R"
sTPM_basedir <- paste0(basedir, "/sTPM/Blueprint")

rsem_files <- list.files(sTPM_basedir, pattern = "genes", full.names = TRUE)
rsem_files <- rsem_files[file.info(rsem_files)$isdir]
rsem_files <- paste0(rsem_files, "/sTPM.sf")
rsem_files <- rsem_files[file.exists(rsem_files)]
names(rsem_files) <- basename(gsub("/sTPM.sf", "", rsem_files))

sTPM_hsa <- tximport(files = rsem_files, type = "rsem", txIn = TRUE, txOut = FALSE, countsFromAbundance = "scaledTPM", readLength = 100)

Error in if (is.null(importer) & !kallisto.h5) { : 
  missing value where TRUE/FALSE needed

ADD REPLY • link 6.7 years ago stephan.emmrich ▴ 10

0

Entering edit mode

I’m a bit lost on how this error is coming up. Tell me more about these files. They are isoform level results from RSEM (because you have txIn=TRUE)?

ADD REPLY • link 6.7 years ago Michael Love 43k

0

Entering edit mode

The files are from Blueprint DCC portal, primary analysis done by Grape2 pipeline (https://github.com/guigolab/grape-nf).

This is the header and first 5 rows of such a file (total 30).

gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM	posterior_mean_count	posterior_standard_deviation_of_count	pme_TPM	pme_FPKM	TPM_ci_lower_bound	TPM_ci_upper_bound	FPKM_ci_lower_bound	FPKM_ci_upper_bound
ENSG00000000003.13	ENST00000373020.7,ENST00000494424.1,ENST00000496771.4,ENST00000612152.3,ENST00000614008.3	2074.1	1837.69	885	3.65	4.35	885	0	3.64	4.38	3.33516	3.96303	4.02217	4.77881
ENSG00000000005.5	ENST00000373031.4,ENST00000485971.1	940.5	704.26	0	0	0	0	0	0.03	0.04	0.000306	0.08152	0.000426	0.098299
ENSG00000000419.11	ENST00000371582.7,ENST00000371584.7,ENST00000371588.8,ENST00000413082.1,ENST00000466152.4,ENST00000494752.1	1079.47	843.07	5697	51.15	61.01	5697	0	50.66	61.02	49.2778	51.9333	59.392	62.5898
ENSG00000000457.12	ENST00000367770.4,ENST00000367771.9,ENST00000367772.7,ENST00000423670.1,ENST00000470238.1	1538	1301.6	16	0.09	0.11	16.01	0.11	0.11	0.13	0.06088	0.154207	0.074289	0.186764
ENSG00000000460.15	ENST00000286031.9,ENST00000359326.7,ENST00000413811.3,ENST00000459772.4,ENST00000466580.5,ENST00000472795.4,ENST00000481744.4,ENST00000496973.4,ENST00000498289.4	2066.69	1830.32	659	2.73	3.25	658.99	0.11	2.78	3.35	2.40965	3.16032	2.90657	3.81134
ENSG00000000938.11	ENST00000374003.6,ENST00000374004.4,ENST00000374005.6,ENST00000399173.4,ENST00000457296.4,ENST00000468038.1,ENST00000475472.4	2729	2492.6	4	0.01	0.01	4	0	0.07	0.08	0.023157	0.115701	0.027981	0.139542

Actually I am not interested in isoform quantitation so far, as I want to compare to two other species, with gene symbol as common identifier, Due to multimapped reads fro the non-model organism isoform comparisions would be difficult to interprete. I just took the code from the paper, when I tried to run it as simple minimal version i just called for "type" and "countsFromAbundance"

ADD REPLY • link 6.6 years ago stephan.emmrich ▴ 10

0

Entering edit mode

Hmm, I think the problem may be that these are not in RSEM format. Due to other code problems, I've had to hard code the parsing of each column, and so I can only support files output directly by RSEM.

I don't think tximport() is useful for you here, and I'd recommend just reading in the individual files and building a gene count matrix manually. You can round() this and provide it to DESeqDataSetFromMatrix (for DESeq2) or other packages like edgeR and limma can take fractional counts directly.

ADD REPLY • link 6.6 years ago Michael Love 43k

0

Entering edit mode

Ok, so here is an output i directly created by using our own sequenced data:

gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
gene:ENSG00000000003_TSPAN6	transcript:ENST00000373020_TSPAN6-201,transcript:ENST00000612152_TSPAN6-204,transcript:ENST00000614008_TSPAN6-205	2583.71	2355.23	154	0.92	1
gene:ENSG00000000005_TNMD	transcript:ENST00000373031_TNMD-201	1339	1110.52	0	0	0
gene:ENSG00000000419_DPM1	transcript:ENST00000371582_DPM1-201,transcript:ENST00000371584_DPM1-202,transcript:ENST00000371588_DPM1-203,transcript:ENST00000413082_DPM1-204	1084.35	855.87	3106	51.33	55.27
gene:ENSG00000000457_SCYL3	transcript:ENST00000367770_SCYL3-201,transcript:ENST00000367771_SCYL3-202,transcript:ENST00000367772_SCYL3-203,transcript:ENST00000423670_SCYL3-204	4358.65	4130.17	1455.18	4.98	5.37
gene:ENSG00000000460_C1orf112	transcript:ENST00000286031_C1orf112-201,transcript:ENST00000359326_C1orf112-202,transcript:ENST00000413811_C1orf112-203,transcript:ENST00000459772_C1orf112-204,transcript:ENST00000466580_C1orf112-205,transcript:ENST00000472795_C1orf112-206,transcript:ENST00000481744_C1orf112-207,transcript:ENST00000496973_C1orf112-208	3127.66	2899.18	1603.82	7.82	8.43

only difference here is that it didn't do the posterior mean and calc-ci option for the additional columns compared to the Blueprint sample above.

This is the same data format?

I wanted to use tximport as to have a library-size normalized TPM metrics for downstream analysis. If I understand you correctly, for my purpose of comparing genes not isoforms this is not necessary because edgeR/Glimma can provide depth-normalization based on counts and this I could use on the TPMs from RSEM?

ADD REPLY • link 6.6 years ago stephan.emmrich ▴ 10

0

Entering edit mode

I think we're confusing some things here. TPM are already library size corrected. But TPM are not ideal for computing differential expression, because they have thrown out information about the precision of the abundance estimate. You can provide estimated counts to DESeq2, edgeR or limma and they will correct for library size. RSEM computes gene-level estimated counts the same way that tximport does, so I think you'll perhaps avoid more problems by assembling these matrices yourself.

ADD REPLY • link 6.6 years ago Michael Love 43k

0

Entering edit mode

Ok, I will do that then. Thanks a lot for your help, very educating!!

One thing about TPMs though: If I would want to compare between gene abundances in one sample-set vs another (e.g. one species vs the other), then I would choose TPMs, right? And log2 transform them for downstream DGE analysis.

Now, when RSEM provides effective counts by abundance estimation (expectation maximization they call it in their paper Haas BJ et al, 2013 Nat Protocols), then the RSEM TPMs should include that already, or am I confusing sth again? If so, then RSEM TPMs would be scaledTPMs?

ADD REPLY • link 6.6 years ago stephan.emmrich ▴ 10

0

Entering edit mode

Here's my recommendation, because I don't know if I can answer all of your questions, but I do want to help guide you to the right analysis: import gene-level estimated counts from your files (avoiding tximport only because you have some difficulty reading in these files). Pass this matrix of counts to your choice of DGE method in Bioconductor. You can work with the TPMs if you want, but then I don't have any software or analysis recommendations for you, because the methods I am familiar with make use of counts.

ADD REPLY • link 6.6 years ago Michael Love 43k

0

Entering edit mode

I just updated tximport to v1.7.14 on GitHub with new error testing code. Can you try this version?

ADD REPLY • link 6.7 years ago Michael Love 43k

0

Entering edit mode

Interestingly, now I get

Error: length(files) > 0 is not TRUE

I read that this is a common issue with RSEM and shouldn't this in your script already take care of it?

if (type == "rsem") {
# protect against 0 bp length transcripts
txi$length[txi$length < 1] <- 1

ADD REPLY • link updated 6.6 years ago by Michael Love 43k • written 6.6 years ago by stephan.emmrich ▴ 10

0

Entering edit mode

Good! I suspected this might be the issue so added a bit of code to protect you and future users from confusing error messages. The problem is that the first argument of tximport() needs to have at least 1 file. You are providing an empty vector.

ADD REPLY • link 6.6 years ago Michael Love 43k

0

Entering edit mode

Hi ,I have this length(files) > 0 is not TRUE. I don't understand this last response. I paste my code here, could you let me know if you have any idea how to fix this:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("tximport")
library(tximport)
samples <- list.files(path = "home/data/Continuum_ChIPSeq/R_txtimport", full.names = T, pattern="\\.salmon$")
files <- file.path(samples, "quant.sf")
tx2gene <- read.delim("tx2gene_grch38_ens94.txt")
head(tx2gene)
txi <- tximport(files, type="salmon", tx2gene=tx2gene[,c("tx_id", "ensgene")], countsFromAbundance="lengthScaledTPM")

ADD REPLY • link updated 5.2 years ago by Michael Love 43k • written 5.2 years ago by jason.williams ▴ 10

1

Entering edit mode

The error means that your files vector is empty. Try things out in your R session and you may be able to debug the issue yourself:

E.g. if R is telling you that the files vector is empty, take a look at the files vector by printing it:

> files

What do you see? And if this is indeed an empty vector, take a look at the line where you construct files. Maybe there is something about the samples object, so take a look at that:

> samples

etc.

ADD REPLY • link 5.2 years ago Michael Love 43k