I am attempting to do differential gene expression analysis on kallisto aligned data from the TOIL project. I want to use tximport to summarize the transcript level data to the gene level. The format of the abundance and count files is a matrix with ENST transcript IDs as rows and sample names as columns. I am wondering how I can use tximport to summarize these transcripts to the gene level given that the data is not in the classic kallisto format. If it is not possible to use tximport, how should I summarize the transcript IDs to gene names?
Sorry, I misspoke. The data was from the TOIL project and was aligned using kallisto. It was accessed from the UCSC XENA browser. I updated my post with these corrections.
If you just have the transcript data, you probably don't need (and maybe cannot use) tximport. Instead you could just do the naive thing and average. You will first have to construct a data.frame that has transcripts in one column and genes in another. This isn't trivial because Ensembl is always updating things, so unless you know the transcriptome version used, you will have to iterate through various EnsDb versions on the AnnotationHub in order to find the right one. Here is some semi-fake code to illustrate what I mean
library(AnnotationHub)
hub <- AnnotationHub()
z <- query(hub, c("homo sapiens","ensdb"))
## Now let's assume that the most recent one is AH123456 (it's not - this is fake code after all)
ensdb <- hub[["AH123456"]]
sum(rownames(<TOIL DATA GOES HERE>) %in% keys(ensdb, "TXID"))/nrow(<TOIL DATA GOES HERE>)
## keep doing that with different versions of Ensembl until you get to a sufficiently high percentage
## of transcripts, where 'sufficiently high' is up to you
mapper <- select(ensdb, rownames(<TOIL DATA GOES HERE>), "GENEID", "TXID")
library(limma)
averaged_data <- avereps(<TOIL DATA GOES HERE>, mapper[,1])
recount 2 does not use kallisto, does it? And recount does offer genelevel counts, so what's the point?
Sorry, I misspoke. The data was from the TOIL project and was aligned using kallisto. It was accessed from the UCSC XENA browser. I updated my post with these corrections.