I have 20 paired end libraries pseudomapped to the rat transcriptome with Salmon and I wish to create a tx2gene dataframe via the ensembldb package for downstream analysis in deseq2 as recommended here:
Note: if you are using an Ensembl transcriptome, the easiest way to create the tx2gene data.frame is to use the ensembldb packages. The annotation packages can be found by version number, and use the pattern EnsDb.Hsapiens.vXX. The transcripts function can be used with return.type="DataFrame", in order to obtain something like the df object constructed in the code chunk above. See the ensembldb package vignette for more details.
however, while biocLite("EnsDb.Hsapiens.v75") works fine, biocLite("EnsDb.Rnorvegicus.v89") returns: Warning message: package 'EnsDb.Rnorvegicus.v89' is not available (for R version 3.4.0)
Is this a case of trying to use the wrong tool, i.e these recommendations apply to human data but not other species... or some other issue? Would BioMart help?
the warning message just means that the EnsDb.Rnorvegicus.v89 package is not available - as packages for the rat genome there are only EnsDb.Rnorvegicus.v75 and EnsDb.Rnorvegicus.v79 available that you could install using the biocLite function.
If you need more recent gene models, you can get the Ensembl rat data for Ensembl version 87 and 88 from AnnotationHub (I am currently building the ones for Ensembl 89, but that takes some time):
library(AnnotationHub)
ah <- AnnotationHub()
## To get the EnsDb for Rnorvegicus, Ensembl version 88:
edb <- query(ah, "EnsDb.Rnorvegicus.v88")[[1]]
## You can then use this edb for your queries
transcripts(edb)
GRanges object with 41078 ranges and 6 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENSRNOT00000044187 1 [396700, 409676] + | ENSRNOT00000044187
ENSRNOT00000072186 1 [396700, 409676] + | ENSRNOT00000072186
ENSRNOT00000093216 1 [396840, 409750] + | ENSRNOT00000093216
... ... ... ... . ...
ENSRNOT00000085333 Y [2653008, 2654859] + | ENSRNOT00000085333
ENSRNOT00000092839 Y [3181118, 3181328] + | ENSRNOT00000092839
ENSRNOT00000086356 Y [3253610, 3254888] + | ENSRNOT00000086356
tx_biotype tx_cds_seq_start tx_cds_seq_end
<character> <integer> <integer>
ENSRNOT00000044187 processed_transcript <NA> <NA>
ENSRNOT00000072186 processed_transcript <NA> <NA>
ENSRNOT00000093216 processed_transcript <NA> <NA>
... ... ... ...
ENSRNOT00000085333 lincRNA <NA> <NA>
ENSRNOT00000092839 processed_pseudogene <NA> <NA>
ENSRNOT00000086356 lincRNA <NA> <NA>
gene_id tx_name
<character> <character>
ENSRNOT00000044187 ENSRNOG00000046319 ENSRNOT00000044187
ENSRNOT00000072186 ENSRNOG00000046319 ENSRNOT00000072186
ENSRNOT00000093216 ENSRNOG00000046319 ENSRNOT00000093216
... ... ...
ENSRNOT00000085333 ENSRNOG00000052946 ENSRNOT00000085333
ENSRNOT00000092839 ENSRNOG00000062169 ENSRNOT00000092839
ENSRNOT00000086356 ENSRNOG00000058415 ENSRNOT00000086356
-------
seqinfo: 162 sequences from Rnor_6.0 genome
Hi, I have a question, the output of kallisto's transctrip name is
ENSMUST00000178537.1 but there is no .1 in ensembldb's output tx_name . when I do the tximport, error:
Error in summarizeToGene(txi.kallisto, tx2gene) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
Do you know how to solve this problem? Thanks a lot for your time.
You have to remove the transcript version number from the transcript IDs (i.e. the .1). Just be sure that the Ensembl version of the EnsDb you are using and the version that was used for kallisto match.
A fast way to remove them is e.g. top_table$tx_id <- sub("\\.[0-9]*$", "", top_table$tx_id)
That makes sense. It seems since the pseudomapping was done to the v89 transcriptome that the same gene model should be used. I guess more of a theoretical than a practical question of whether an earlier gene model would be appropriate. Thanks again!
I have just started a zebrafish project, will you be creating and EnsDbs for Ensembl v89 of Danio rerio as well, if not may I ask the best way to? Thanks
I create EnsDbs for all species defined in Ensembl, this includes also Danio rerio. Once I'm done, these EnsDbs will show up in the AnnotationHub of the Bioc devel version.
Note that the Danio rerio EnsDbs for Ensembl 87 and 88 are already in AnnotationHub
Hi, I have a question, the output of kallisto's transctrip name is
Do you know how to solve this problem? Thanks a lot for your time.
You have to remove the transcript version number from the transcript IDs (i.e. the .1). Just be sure that the Ensembl version of the
EnsDb
you are using and the version that was used for kallisto match.A fast way to remove them is e.g.
top_table$tx_id <- sub("\\.[0-9]*$", "", top_table$tx_id)
FYI: Johannes' solution is automagically performed within the
tximport
function when specifying the argumentignoreTxVersion = TRUE
(default =FALSE
).Check the help page for ?tximport. You can ignore version numbers