Question

Using 'tximport' library for downstream DGE after quantifying with Kallisto

1

Entering edit mode

cguzman.bioinformatics ▴ 10

@cguzmanbioinformatics-10132

Last seen 8.6 years ago

Cross-posted on Biostars: https://www.biostars.org/p/187335/

Sorry for horrible formatting. I am not used to the markdown on this site.

I'm quite new to RNA-sequencing and am playing around with data to get a handle on it. I have quantified with `Kallisto` and am using `tximport` to summarize transcript counts for differential gene expression analysis.

I am running into a problem associating gene ID's with my transcripts for the summarization portion. I believe that the likely cause is the actual TxDb library I am using and that it may be different from the transcriptome file I used, but I am not sure and my attempts at solving this haven't been successful.

I am working with human samples. I quantified my transcripts using [this transcriptome file for homo sapiens][1]. I have 6 samples, 3 WT replicates, and 3 KO replicates.

1. I created a vector pointing to my kallisto files as detailed in the [tximport manual.][2]

    `files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")`

2. I created a data.frame from a TxDb object to construct the tx2gene table.

    `library(TxDb.Hsapiens.UCSC.hg38.knownGene)`

    `txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene`

    `k <- keys(txdb, keytype = "GENEID")`

    `df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")`

    `tx2gene <- df[, 2:1]  # tx ID, then gene ID`

But `head(tx2gene)` produces:

TXNAME GENEID
1 uc002qsd.4 1
2 uc002qsf.2 1
3 uc003wyw.1 10
4 uc002xmj.3 100
5 uc010xbn.1 1000
6 uc002kwg.2 1000

This obviously isn't right.

3. Using tximport's `tximport` function.

    `library(tximport)`

    `library(readr)`

    `txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv)`

    `names(txi)`

Does the following:

> txi $abundance
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
>
> $counts
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
>
> $length
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
>
> $countsFromAbundance
>
> [1] "no"

and `head(txi$counts)`:

> head(txi$counts)
>
> sample 1 sample 2 sample 3 sample 4 sample 5 sample 6

I'm not completely sure what i'm doing incorrectly. I'll give it another shot after lunch, it might just be the frustration at this point but any help is appreciated.

[1]: http://bio.math.berkeley.edu/kallisto/transcriptomes/
[2]: https://github.com/mikelove/tximport/blob/master/vignettes/tximport.md

tximport Kallisto rna-seq • 5.9k views

ADD COMMENT • link updated 7.8 years ago by pablo_garcia05 ▴ 20 • written 8.6 years ago by cguzman.bioinformatics ▴ 10

0

Entering edit mode

Hi, I don't wnat to open a new discussion so maybe here I can get help.

When you are working with no identified transcripts from a de novo assembly (Trinity) how do you cover this step? I mean, I don't have a data source for this tx2gene requirement.

Thank you.

ADD REPLY • link 7.8 years ago pablo_garcia05 ▴ 20

0

Entering edit mode

I moved this post from an 'Answer' to a 'Comment'

You need to come up with a way to group transcripts into genes if you want to summarize your transcripts into genes.

I believe that new software from Alicia Oshlack's group can help with this:

https://github.com/Oshlack/Lace/wiki

https://github.com/Oshlack/Lace/wiki/Algorithm

It groups transcripts by pairwise alignments.

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Thank you M. Love and one question, what about to use the output file with the suffix ".genes" produced by Trinity quantification pipeline? for example when I run :

$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left AA_1.fastq  \
         --right AA_2.fastq  \
         --transcripts Trinity.fasta  \
         --output_prefix AA \
         --est_method kallisto \
         --trinity_mode  \
         --output_dir AA_kallisto

It generates abundance.tsv and abundance.tsv.genes. Could I continue from this ".genes" > tx2gene > DESeq2??

Thank you for your att.

ADD REPLY • link 7.8 years ago pablo_garcia05 ▴ 20

0

Entering edit mode

You should be able to, as long as you can specify the names of the columns in which to find the appropriate information. For example, when you tell tximport, type="salmon" or "sailfish" etc, it just populates the arguments which specify the column names. For gene-level quantification, you would follow the example of RSEM, which inside tximport populates the following arguments for the user:

  if (type == "rsem") {
    txIn <- FALSE
    geneIdCol <- "gene_id"
    abundanceCol <- "FPKM"
    countsCol <- "expected_count"
    lengthCol <- "effective_length"
  }

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Hello Mr Love, thank you for your suggestions to my last questions. I come again with new doubts.

I'm doing this DGE with a de novo assembly so I tried to apply your suggestion (in this case I have used EvidentialGene to cluster my transcript to reduce the redundancy). Finally I could manage the construction of this tx2genes file just blasting my transcriptome and generating this 2 columns file with: TranscriptID | blast hit. I think that's correct for those genes identified but I have a lot of transcript unidentified and now they're out of the DGE analysis because they can't be summarize to gene level.

What do you think about this approach? I have this doubt because report differences about "contigs" deferentially expressed when I don't know what they are or if they come from the same gene or different genes seems a bad choice. In this case my actual use of tximport>DESeq2 could be fine?

Thanks a lot,

(please excuse me if my English is not good enough or polite I try my best...)

Pablo GF

ADD REPLY • link 7.6 years ago pablo_garcia05 ▴ 20

0

Entering edit mode

hi Pablo,

I don't follow the exact details but if you can come up with a reasonable clustering of de novo transcripts into likely genes, the rest of the tximport => DESeq2 pipeline should give you useful results.

I think Corset from Alicia Oshlack's group is another software you might take a look at:

https://github.com/Oshlack/Corset/wiki

ADD REPLY • link 7.6 years ago Michael Love 43k

score 6 · Accepted Answer · 2016-04-19

6

Entering edit mode

Michael Love 43k

@mikelove

Last seen 13 hours ago

United States

Hi,

So that's a Ensembl transcriptome you used with kallisto. You can see the genome and Ensembl release info here: ...GRCh38.rel79...

Meanwhile, TxDb.Hsapiens.UCSC.hg38.knownGene is UCSC based and doesn't tell you anything about the Ensembl genes and transcripts.

Here is an annotation package which will give you access to the Ensembl Release 79 transcripts and genes:

http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

How to use the objects in this Ensembl annotation package is shown in this "parent" package:

http://bioconductor.org/packages/release/bioc/html/ensembldb.html

I just scanned the vignette for ensembldb and found some useful code. For example, this gives you a DataFrame linking transcripts and genes:

txdf <- transcripts(EnsDb.Hsapiens.v79, return.type="DataFrame")

Then:

tx2gene <- as.data.frame(txdf[,c("tx_id","gene_id")])

ADD COMMENT • link 8.6 years ago Michael Love 43k

0

Entering edit mode

This is what I had thought was the problem, I just could not for the life of me find the TxDb of Ensembl.

Thank you!

ADD REPLY • link 8.6 years ago cguzman.bioinformatics ▴ 10

0

Entering edit mode

in v0.99.9 I've added a more informative error if no transcripts in tx2gene are present in the quantification files, and I've added some information to the vignette regarding Ensembl transcriptomes and the use of transcripts() with the ensembldb packages.

ADD REPLY • link 8.6 years ago Michael Love 43k

score 2 · Accepted Answer · 2016-04-19

The TxDb package you are using is based on UCSC's transcript IDs, not Ensembl, which is apparently the source of the transcripts you used. If you are using Ensembl-centric transcripts, then you should probably be using one of the EnsDbs that Johannes Rainer generates, for instance this one.

But what you are doing is sort of weird, using Ensembl transcript IDs, and mapping to NCBI's Gene IDs. There will always be issues mapping between the two, so you might want to use either Ensembl transcript and gene IDs, or NCBI's RefSeq and Gene IDs.

If you want to do the cross-annotation source mapping, you can use the org.Hs.eg.db package.

> d.f <- select(org.Hs.eg.db, keys(org.Hs.eg.db), "ENSEMBLTRANS")
'select()' returned 1:many mapping between keys and columns
> head(d.f[,2:1])
     ENSEMBLTRANS ENTREZID
1 ENST00000596924        1
2 ENST00000263100        1
3 ENST00000595014        1
4 ENST00000598345        1
5 ENST00000600966        1
6 ENST00000543436        2