Question

Using tximport for kallisto aligned TOIL data

0

Entering edit mode

Nicholas • 0

@3611f731

Last seen 29 days ago

United States

I am attempting to do differential gene expression analysis on kallisto aligned data from the TOIL project. I want to use tximport to summarize the transcript level data to the gene level. The format of the abundance and count files is a matrix with ENST transcript IDs as rows and sample names as columns. I am wondering how I can use tximport to summarize these transcripts to the gene level given that the data is not in the classic kallisto format. If it is not possible to use tximport, how should I summarize the transcript IDs to gene names?

kallisto tximport TOIL • 544 views

ADD COMMENT • link 6 weeks ago • updated 4 weeks ago Nicholas • 0

0

Entering edit mode

recount 2 does not use kallisto, does it? And recount does offer genelevel counts, so what's the point?

ADD REPLY • link 6 weeks ago ATpoint ★ 4.8k

0

Entering edit mode

Sorry, I misspoke. The data was from the TOIL project and was aligned using kallisto. It was accessed from the UCSC XENA browser. I updated my post with these corrections.

ADD REPLY • link 6 weeks ago Nicholas • 0

score 0 · Answer 1 · 2025-03-10

If you just have the transcript data, you probably don't need (and maybe cannot use) tximport. Instead you could just do the naive thing and average. You will first have to construct a data.frame that has transcripts in one column and genes in another. This isn't trivial because Ensembl is always updating things, so unless you know the transcriptome version used, you will have to iterate through various EnsDb versions on the AnnotationHub in order to find the right one. Here is some semi-fake code to illustrate what I mean

library(AnnotationHub)
hub <- AnnotationHub()
z <- query(hub, c("homo sapiens","ensdb"))
## Now let's assume that the most recent one is AH123456 (it's not - this is fake code after all)
ensdb <- hub[["AH123456"]]
sum(rownames(<TOIL DATA GOES HERE>) %in% keys(ensdb, "TXID"))/nrow(<TOIL DATA GOES HERE>)
## keep doing that with different versions of Ensembl until you get to a sufficiently high percentage
## of transcripts, where 'sufficiently high' is up to you
mapper <- select(ensdb, rownames(<TOIL DATA GOES HERE>), "GENEID", "TXID")
library(limma)
averaged_data <- avereps(<TOIL DATA GOES HERE>, mapper[,1])