Entering edit mode
Hello,
I am trying to use tximport to extract counts data generated by kallisto.
First I create the transcript-geneID file by:
library(EnsDb.Rnorvegicus.v79)
edb <- EnsDb.Rnorvegicus.v79
Tx <- transcripts(edb,
columns = c(listColumns(edb, "tx"), "gene_name"),
filter = TxBiotypeFilter("nonsense_mediated_decay"),
return.type = "DataFrame")
Then I use tximport:
files <- file.path(dir, samples$Out, "abundance.tsv")
library(tximport)
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = Tx, dropInfReps=TRUE, ignoreTxVersion = TRUE)
But the result is wired. It is not counts for each gene, but seems counts for all the genes:
> txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = Tx, dropInfReps=TRUE, ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4
transcripts missing from tx2gene: 46826
summarizing abundance
summarizing counts
summarizing length
> head(txi.kallisto.tsv$counts)
[,1] [,2] [,3] [,4]
nonsense_mediated_decay 857.8022 1422.593 909.1382 425.9061
Does anyone know what is the problem?
Many thanks!
Also note that the input for
tx2gene
should be a two-column data.frame (linking transcript id (column 1) to gene id (column 2); the column names are not relevant, but this column order must be used). See?tximport
.Your current input contains way more than 2 columns!
In addition: please be aware that you filtered the transcript mapping file to only include transcripts that are annotated as
nonsense_mediated_decay
. Personally I would not apply filtering at this phase of reading in the data but rather just before starting the differential expression analysis.Hi Guido, that's the problem: Your current input contains way more than 2 columns! Many thanks for your help. I solve it.
By the way, although I turned off the filter, there are still 27366 missing. Is it normal?
No, that is not OK. The message indicates that for 27k transcripts you have count data available, but these transcripts are not in the transcript-to-gene mapping file. These transcripts are then discarded from analyses.
I suspect that the reason for this is that the versions of your transcript mapping file (the FASTA you used) and the Ensembl annotation (
EnsDb.Rnorvegicus.v79
) do not match; ensembl 80 was released in May 2015 (so v79 even few months earlier), and the FASTA you used likely much later. This was also hinted at by ATpoint.Note that nowadays you can obtain ensembl-based annotation libraries through the so-called
AnnotationHub
; see e.g. this thread (Where can I find EnsDb.Hsapiens.v105?), including the link in my reply (EnsDb.Rnorvegicus for Rnor6) for some code to get you started with this.Hi Guido, many thanks. Yes you are right, I used the wrong version. Thank you for your solution. I also tried to use the same GTF file which I used for kallisto aligner, but there are still 69 missing.
I wonder how many missing transcripts are acceptable?
How would you define 'acceptable'?
Personally I would only be satisfied if there are no missing transcripts at all in the mapping file (that is the reason you really need to know which version of ensembl the GTF/GFF/FASTA files you used correspond to, so you could match these with the correct EnsDb), but you could also argue 69 is a very limited number when compared to all transcripts you have mapping data for...
Thank you for your reply! But in this case, I have used the same GTF file that I used in kallisto for alignment and there are still 69 missing. What is the reason for the missing and how to solve it?