Question

Salmon/DESeq2: tx2gene discrepancy in DESeq object

0

Entering edit mode

gmchaput • 0

@gmchaput-22280

Last seen 2.6 years ago

United States

I have been working with a RNAseq dataset and wanted to change the names of my GENEID during tximport so I could match it with a dataframe in a downstream analysis. However, if I change the GENEID all of a sudden DESeq2 says I have 30200 genes rather than 25900. I thought tximport only is matching the transcript IDs between the gtf file and the quant.sf file?

Just to explain the code below:

I get my gff3 file from JGI Phytozome database for Panicum hallii var. hallii v2.1 (Hall's panicgrass) and convert it to a gtf file
The transcript ID does not match my quant.sf file so I remove the ".v2.1" in the TXNAME
The discrepancy only occurs when I change the GENEID and nothing else

Here is the code:

TxDb = makeTxDbFromGFF("PhalliiHAL_annot.gtf", format = "gtf")
k <- keys(TxDb, keytype = "GENEID")
df <- ensembldb::select(TxDb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
head(tx2gene) #should have 2 columns with transcript ID as the first column
#because the ".v2.1" won't match with the other file, I need to remove these for both gene and transcript
tx2gene$TXNAME = as.character(tx2gene$TXNAME) #make sure it's a character vector
TXNAME = as.vector(tx2gene$TXNAME)
GENEID = as.vector(tx2gene$GENEID)
n <- 3 # keep first 3 fields only
tx2gene$TXNAME = do.call(paste, c(read.table(text = TXNAME, sep = ".")[1:n], sep = "."))
tx2gene$GENEID = do.call(paste, c(read.table(text = TXNAME, sep = ".")[1:n], sep = "."))
samples = read.table("samples.csv", header =TRUE, sep = ",")
files = file.path("PhalliiHAL_quants", samples$sample, "quant.sf") 
names(files) <- samples$sample[1:42]
all(file.exists(files))txi = tximport(files, type ="salmon", tx2gene = tx2gene)

deseq2 tximport salmon • 1.4k views

ADD COMMENT • link updated 5.2 years ago by Michael Love 43k • written 5.2 years ago by gmchaput • 0

score 0 · Answer 1 · 2020-01-31

Yes, tximport is matching the transcript IDs in the quantification files with column 1 of tx2gene.

To debug what's going on, I'd recommend to first import data with tximport using the table that is created from the GFF (with only modifications necessary to match the transcript IDs, e.g. stripping version numbers in the GFF which are not present in FASTA).

Then as a second step, you could rename the gene IDs.

Separating into two steps will help make it easier for you and others to know what's happening at each step.