Hi all,
I ran into an edge case situation of kallisto not processing GENCODE transcript identifiers correctly, and this currently propagates into tximport. Ideally this should be fixed upstream in kallisto, but we should harden tximport against this situation.
Here's an example kallisto run aligned against GENCODE that is problematic: https://share.steinbaugh.com/kallisto-gencode-dmso.tar.gz
Can we add a step to check for identifiers still incorrectly pipe (|
) delimited?
target_id length eff_length est_counts tpm
ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|lncRNA| 1657 1431.94 16.5617 0.57248
Here's a rough draft:
library(tximport)
files <- sort(list.files(
path = "kallisto-gencode",
pattern = "abundance.h5",
full.names = TRUE,
recursive = TRUE
))
names(files) <- make.names(basename(dirname(files)))
txi <- tximport(
files = files,
type = "kallisto",
txIn = TRUE,
txOut = TRUE,
ignoreTxVersion = FALSE
)
extractGencodeIds <- function(x) {
if (!all(grepl(pattern = "|", x = x, fixed = TRUE))) {
return(x)
}
mat <- do.call(
what = rbind,
args = strsplit(x = x, split = "|", fixed = TRUE)
)
mat[, 1L]
}
rownames(txi[["abundance"]]) <- extractGencodeIds(rownames(txi[["abundance"]]))
rownames(txi[["counts"]]) <- extractGencodeIds(rownames(txi[["counts"]]))
rownames(txi[["length"]]) <- extractGencodeIds(rownames(txi[["length"]]))
See related:
- https://www.biostars.org/p/419605/
- https://groups.google.com/g/kallisto-and-applications/c/KQ8782UD35E/m/hbqqMOgGBwAJ
Paging Michael Love
Best, Mike
Thanks for the update Mike, I'll see if we can fix this upstream in kallisto. salmon has a similar mode with the
--gencode
flag to handle this situation with the FASTA file.PS here's a modified version of the code above that works using
sub
instead: