Question

quant.sf files and tximport, transcripts not recognized

0

Entering edit mode

Merlin ▴ 10

@merlin-15723

Last seen 4.9 years ago

Vancouver

Hello Folks,

I generated quant.sf file with Salmon tool and the next step is to Import the transcripts abundance dataset with tximport. I generated the file.csv using the same annotation file used in salmon,

> head(tx2gene)

             TXNAME            GENEID
1 ENST00000456328.2 ENSG00000223972.4
2 ENST00000515242.2 ENSG00000223972.4
3 ENST00000518655.2 ENSG00000223972.4
4 ENST00000450305.2 ENSG00000223972.4
5 ENST00000473358.1 ENSG00000243485.2
6 ENST00000469289.1 ENSG00000243485.2

Here is the output from a quant.sf file,

cat quant.sf | head -n 3
Name    Length  EffectiveLength TPM     NumReads
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|    1657    1513.346        0.000000        0.000
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|       632     488.811 17.921214       1.000

When I launch the lst script I get that:

    txi <- tximport(files, type="salmon", tx2gene=tx2gene)

> reading in files with read_tsv
    1 2 3 4 5 6 
    Error in summarizeToGene(txi, tx2gene, varReduce, ignoreTxVersion, ignoreAfterBar,  : 

      None of the transcripts in the quantification files are present
      in the first column of tx2gene. Check to see that you are using
      the same annotation for both.

    Example IDs (file): [ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|, ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|, ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|, ...]

    Example IDs (tx2gene): [ENST00000456328.2, ENST00000515242.2, ENST00000518655.2, ...]

      This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

I know that this problem was faced from other people but I couldn't find the solution for my case, do you have any suggestion about what should I change?

And also I have another quesiton, why is needed to use the file.csv? at the end has only the same gene ID of my quant.sf file

Thank you

salmon tximport • 2.0k views

ADD COMMENT • link updated 5.7 years ago by Michael Love 43k • written 5.7 years ago by Merlin ▴ 10

score 2 · Accepted Answer · 2019-04-02

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Take a closer look at the message that is printed, it has some useful information for you.

ADD COMMENT • link 5.7 years ago Michael Love 43k

0

Entering edit mode

I'm not sure if it's read_tsv that is wrong since I don't have tsv file or there is something more required and related to summarizeToGene function

it says this as well,

 None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both

. but I used the same annotation...

reading here: We can avoid gene-level summarization by setting txOut=TRUE, giving the original transcript level estimates as a list of matrices

I changed my command line to

txi.salmon <- tximport(files, type="salmon", tx2gene=tx2gene, txOut=TRUE)

and I don't have error anymore but I don't know if the output that I get is correct to go to DESeq2

Can you tell me that please?

Thank you

ADD REPLY • link 5.7 years ago Merlin ▴ 10

1

Entering edit mode

hi Merlin,

Over the past couple of interactions, I feel like you're not taking the time to double check your work and read relevant messages.

It says above very clearly that the gene IDs in the file look like "ENST00000456328.2|..." while the gene IDs in the tx2gene table look like "ENST00000456328.2".

The difference is that there is a bunch of extra characters in the quantification files. The IDs need to be the same for the matching of transcripts to genes to work.

Furthermore, we have built a solution for this already, to "ignore after bar", by setting ignoreAfterBar=TRUE.

And the message that the software prints to the consolue even goes to tell you that you should try this solution and that it may solve your problem.

Please take the time to try to solve these problems on your end before immediately posting for further help from maintainers that are already busy.

ADD REPLY • link 5.7 years ago Michael Love 43k

0

Entering edit mode

Thank you for you answer Michael, Yes It’s at least three days that I m checking my work, and I have also tried to put the two messages indicated in the output but it didn’t work because I ddin’t use the complete command =TRUE. Slowly I’m learning everything

I’m sorry for taking your time, if you consider that is a low level question please don’t answer, that’s my level.

At the end it works , I appreciated

Thank you

ADD REPLY • link 5.7 years ago Merlin ▴ 10