Cross-posted on Biostars: https://www.biostars.org/p/187335/
Sorry for horrible formatting. I am not used to the markdown on this site.
I'm quite new to RNA-sequencing and am playing around with data to get a handle on it. I have quantified with `Kallisto` and am using `tximport` to summarize transcript counts for differential gene expression analysis.
I am running into a problem associating gene ID's with my transcripts for the summarization portion. I believe that the likely cause is the actual TxDb library I am using and that it may be different from the transcriptome file I used, but I am not sure and my attempts at solving this haven't been successful.
I am working with human samples. I quantified my transcripts using [this transcriptome file for homo sapiens][1]. I have 6 samples, 3 WT replicates, and 3 KO replicates.
1. I created a vector pointing to my kallisto files as detailed in the [tximport manual.][2]
`files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")`
2. I created a data.frame from a TxDb object to construct the tx2gene table.
`library(TxDb.Hsapiens.UCSC.hg38.knownGene)`
`txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene`
`k <- keys(txdb, keytype = "GENEID")`
`df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")`
`tx2gene <- df[, 2:1] # tx ID, then gene ID`
But `head(tx2gene)` produces:
TXNAME GENEID
1 uc002qsd.4 1
2 uc002qsf.2 1
3 uc003wyw.1 10
4 uc002xmj.3 100
5 uc010xbn.1 1000
6 uc002kwg.2 1000
This obviously isn't right.
3. Using tximport's `tximport` function.
`library(tximport)` `library(readr)` `txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv)` `names(txi)`
Does the following:
> txi $abundance
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
>
> $counts
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
>
> $length
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
>
> $countsFromAbundance
>
> [1] "no"
and `head(txi$counts)`:
> head(txi$counts)
>
> sample 1 sample 2 sample 3 sample 4 sample 5 sample 6
I'm not completely sure what i'm doing incorrectly. I'll give it another shot after lunch, it might just be the frustration at this point but any help is appreciated.
[1]: http://bio.math.berkeley.edu/kallisto/transcriptomes/
[2]: https://github.com/mikelove/tximport/blob/master/vignettes/tximport.md
Hi, I don't wnat to open a new discussion so maybe here I can get help.
When you are working with no identified transcripts from a de novo assembly (Trinity) how do you cover this step? I mean, I don't have a data source for this tx2gene requirement.
Thank you.
I moved this post from an 'Answer' to a 'Comment'
You need to come up with a way to group transcripts into genes if you want to summarize your transcripts into genes.
I believe that new software from Alicia Oshlack's group can help with this:
https://github.com/Oshlack/Lace/wiki
https://github.com/Oshlack/Lace/wiki/Algorithm
It groups transcripts by pairwise alignments.
Thank you M. Love and one question, what about to use the output file with the suffix ".genes" produced by Trinity quantification pipeline? for example when I run :
It generates abundance.tsv and abundance.tsv.genes. Could I continue from this ".genes" > tx2gene > DESeq2??
Thank you for your att.
You should be able to, as long as you can specify the names of the columns in which to find the appropriate information. For example, when you tell tximport, type="salmon" or "sailfish" etc, it just populates the arguments which specify the column names. For gene-level quantification, you would follow the example of RSEM, which inside tximport populates the following arguments for the user:
Hello Mr Love, thank you for your suggestions to my last questions. I come again with new doubts.
I'm doing this DGE with a de novo assembly so I tried to apply your suggestion (in this case I have used EvidentialGene to cluster my transcript to reduce the redundancy). Finally I could manage the construction of this tx2genes file just blasting my transcriptome and generating this 2 columns file with: TranscriptID | blast hit. I think that's correct for those genes identified but I have a lot of transcript unidentified and now they're out of the DGE analysis because they can't be summarize to gene level.
What do you think about this approach? I have this doubt because report differences about "contigs" deferentially expressed when I don't know what they are or if they come from the same gene or different genes seems a bad choice. In this case my actual use of tximport>DESeq2 could be fine?
Thanks a lot,
(please excuse me if my English is not good enough or polite I try my best...)
Pablo GF
hi Pablo,
I don't follow the exact details but if you can come up with a reasonable clustering of de novo transcripts into likely genes, the rest of the tximport => DESeq2 pipeline should give you useful results.
I think Corset from Alicia Oshlack's group is another software you might take a look at:
https://github.com/Oshlack/Corset/wiki