I have been working with a RNAseq dataset and wanted to change the names of my GENEID during tximport so I could match it with a dataframe in a downstream analysis. However, if I change the GENEID all of a sudden DESeq2 says I have 30200 genes rather than 25900. I thought tximport only is matching the transcript IDs between the gtf file and the quant.sf file?
Just to explain the code below:
- I get my gff3 file from JGI Phytozome database for Panicum hallii var. hallii v2.1 (Hall's panicgrass) and convert it to a gtf file
- The transcript ID does not match my quant.sf file so I remove the ".v2.1" in the TXNAME
- The discrepancy only occurs when I change the GENEID and nothing else
Here is the code:
TxDb = makeTxDbFromGFF("PhalliiHAL_annot.gtf", format = "gtf")
k <- keys(TxDb, keytype = "GENEID")
df <- ensembldb::select(TxDb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
head(tx2gene) #should have 2 columns with transcript ID as the first column
#because the ".v2.1" won't match with the other file, I need to remove these for both gene and transcript
tx2gene$TXNAME = as.character(tx2gene$TXNAME) #make sure it's a character vector
TXNAME = as.vector(tx2gene$TXNAME)
GENEID = as.vector(tx2gene$GENEID)
n <- 3 # keep first 3 fields only
tx2gene$TXNAME = do.call(paste, c(read.table(text = TXNAME, sep = ".")[1:n], sep = "."))
tx2gene$GENEID = do.call(paste, c(read.table(text = TXNAME, sep = ".")[1:n], sep = "."))
samples = read.table("samples.csv", header =TRUE, sep = ",")
files = file.path("PhalliiHAL_quants", samples$sample, "quant.sf")
names(files) <- samples$sample[1:42]
all(file.exists(files))txi = tximport(files, type ="salmon", tx2gene = tx2gene)