Question

failed to use tximport for creating count matrix from kallisto

0

Entering edit mode

Yijing • 0

@d0a8eb95

Last seen 18 months ago

Germany

Hello,

I am trying to use tximport to extract counts data generated by kallisto.

First I create the transcript-geneID file by:

library(EnsDb.Rnorvegicus.v79)
edb <- EnsDb.Rnorvegicus.v79
Tx <- transcripts(edb,
                  columns = c(listColumns(edb, "tx"), "gene_name"),
                  filter = TxBiotypeFilter("nonsense_mediated_decay"),
                  return.type = "DataFrame")

Then I use tximport:

files <- file.path(dir, samples$Out, "abundance.tsv")
library(tximport)
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = Tx, dropInfReps=TRUE, ignoreTxVersion = TRUE)

But the result is wired. It is not counts for each gene, but seems counts for all the genes:

> txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = Tx, dropInfReps=TRUE, ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 
transcripts missing from tx2gene: 46826
summarizing abundance
summarizing counts
summarizing length
> head(txi.kallisto.tsv$counts)
                            [,1]     [,2]     [,3]     [,4]
nonsense_mediated_decay 857.8022 1422.593 909.1382 425.9061

Does anyone know what is the problem?

Many thanks!

tximport kallisto • 1.7k views

ADD COMMENT • link 23 months ago Yijing • 0

score 1 · Answer 1 · 2023-05-02

1

Entering edit mode

ATpoint ★ 4.8k

@atpoint-13662

Last seen 13 hours ago

Germany

transcripts missing from tx2gene: 46826

Make sure that the transcript identifiers in the fasta file that was used for the indexing with kallisto is the same as in the tx2gene table.

ADD COMMENT • link 23 months ago ATpoint ★ 4.8k

1

Entering edit mode

Also note that the input for tx2gene should be a two-column data.frame (linking transcript id (column 1) to gene id (column 2); the column names are not relevant, but this column order must be used). See ?tximport.

Your current input contains way more than 2 columns!

In addition: please be aware that you filtered the transcript mapping file to only include transcripts that are annotated as nonsense_mediated_decay. Personally I would not apply filtering at this phase of reading in the data but rather just before starting the differential expression analysis.

ADD REPLY • link 23 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi Guido, that's the problem: Your current input contains way more than 2 columns! Many thanks for your help. I solve it.

By the way, although I turned off the filter, there are still 27366 missing. Is it normal?

> txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, dropInfReps=TRUE, ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 
transcripts missing from tx2gene: 27366
summarizing abundance
summarizing counts
summarizing length

ADD REPLY • link 23 months ago Yijing • 0

1

Entering edit mode

No, that is not OK. The message indicates that for 27k transcripts you have count data available, but these transcripts are not in the transcript-to-gene mapping file. These transcripts are then discarded from analyses.

I suspect that the reason for this is that the versions of your transcript mapping file (the FASTA you used) and the Ensembl annotation (EnsDb.Rnorvegicus.v79) do not match; ensembl 80 was released in May 2015 (so v79 even few months earlier), and the FASTA you used likely much later. This was also hinted at by ATpoint.

Note that nowadays you can obtain ensembl-based annotation libraries through the so-called AnnotationHub; see e.g. this thread (Where can I find EnsDb.Hsapiens.v105?), including the link in my reply (EnsDb.Rnorvegicus for Rnor6) for some code to get you started with this.

ADD REPLY • link 23 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi Guido, many thanks. Yes you are right, I used the wrong version. Thank you for your solution. I also tried to use the same GTF file which I used for kallisto aligner, but there are still 69 missing.

gtf <- rtracklayer::import("genome.gtf")
tx2gene <- unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])
head(tx2gene)
# Results ----
> txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, dropInfReps=TRUE, ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 
transcripts missing from tx2gene: 69
summarizing abundance
summarizing counts
summarizing length

I wonder how many missing transcripts are acceptable?

ADD REPLY • link 23 months ago Yijing • 0

0

Entering edit mode

How would you define 'acceptable'?

Personally I would only be satisfied if there are no missing transcripts at all in the mapping file (that is the reason you really need to know which version of ensembl the GTF/GFF/FASTA files you used correspond to, so you could match these with the correct EnsDb), but you could also argue 69 is a very limited number when compared to all transcripts you have mapping data for...

ADD REPLY • link 23 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Thank you for your reply! But in this case, I have used the same GTF file that I used in kallisto for alignment and there are still 69 missing. What is the reason for the missing and how to solve it?

ADD REPLY • link 23 months ago Yijing • 0