Question

Correct version of EnsDb.Hsapiens for ensembl transcriptome

0

Entering edit mode

emmak ▴ 20

@emmak-14917

Last seen 7.0 years ago

I downloaded the transcriptome:

ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz , to quantify read counts using Salmon.

I thought it is equivalent to EnsDb.Hsapiens.v86, and created a data frame to map transcript id to gene id.

txdf <- transcripts(EnsDb.Hsapiens.v86, return.type="DataFrame")

tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])

When I imported the data: txi <- tximport(files, type="salmon", tx2gene=tx2gene) #without using ignoreTxVersion = TRUE

I got the error:

Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.

My question is how/where I can obtain the correct version of EnsDb.Hsapiens for the transcriptome (ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz) ?

Below are the first few lines of a file and tx2gene

> head(tx2gene)
tx_id gene_id
1 ENST00000387314 ENSG00000210049
2 ENST00000389680 ENSG00000211459
3 ENST00000387342 ENSG00000210077
4 ENST00000387347 ENSG00000210082
5 ENST00000612848 ENSG00000276345
6 ENST00000386347 ENSG00000209082

> readLines(files[1], 10)

Name   Length   EffectiveLength   TPM   NumReads
ENST00000448914.1   13   5.000   0.000000   0.000000
ENST00000631435.1   12   5.000   0.000000   0.000000
ENST00000434970.2   9   4.000   0.000000   0.000000
ENST00000415118.1   8   4.000   0.000000   0.000000
ENST00000633010.1   16   6.000   0.000000   0.000000
ENST00000632968.1   17   6.000   0.000000   0.000000
ENST00000603693.1   19   7.000   0.000000   0.000000
ENST00000452198.1   18   7.000   0.000000   0.000000
ENST00000632609.1   31   10.000   0.000000   0.000000

Any help would be greatly appreciated,

Thank you,

ensembldb ensembl release rnaseq salmon tximport • 4.0k views

ADD COMMENT • link 7.2 years ago emmak ▴ 20

score 0 · Answer 1 · 2018-02-06

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

Can you show head(tx2gene) and also show us the first few lines of one of the quant files:

readLines(files[1], 10)

ADD COMMENT • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Thank you, Mike. I just added additional information. It ran if I used the ignoreTxVersion. But I worry that I may miss some genes because of the mismatch.

ADD REPLY • link 7.2 years ago emmak ▴ 20

2

Entering edit mode

You don't have a choice because your tx2gene doesn't contain any version numbers. I don't think there should be duplicate transcripts in the cDNA FASTA with the same ID but different version number.

ADD REPLY • link 7.2 years ago Michael Love 43k

1

Entering edit mode

I agree with Mike, you're using the same Ensembl release, so you can expect that, even if the version number on the tx ID is not present in the EnsDb, they refer to the same transcripts. Ensembl is quite strict and consistent with their IDs.

ADD REPLY • link 7.2 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thank you, Johannes!

ADD REPLY • link 7.2 years ago emmak ▴ 20

0

Entering edit mode

Thank Mike, I will go with ignoreTxVersion.

I tried to download hg38 transcriptome from ucsc (for salmon) but I am not sure which is correct one (please advise ). So, I tried mrna and refMrna (from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/) but I got the above error for both. It looks like they have different IDs from tx2gene.

Then, I tried the piece of code in your doc (http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) and got the same error. I don't know if I did something wrong. Do you have any suggestion ?

dir <- system.file("extdata", package = "tximportData") list.files(dir) samples <- read.table(file.path(dir, "samples.txt"), header = TRUE) samples files <- file.path(dir, "salmon", samples$run, "quant.sf") names(files) <- paste0("sample", 1:6) all(file.exists(files))

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene k <- keys(txdb, keytype = "GENEID") df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME") tx2gene <- df[, 2:1] # tx ID, then gene ID head(tx2gene)

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

> readLines(files[1], 10)

Name   Length   EffectiveLength   TPM   NumReads
NR_103451   865   706.217   0.048255   1.000000
NM_001243523   577   182.000   0.000000   0.000000
NR_038931   2432   2416.429   0.072372   5.131725
NR_110396   2909   2445.053   1.225229   87.906687
NR_038401   2146   1696.583   0.140607   7.000000
NR_027354   1104   886.567   0.079356   2.064459
NR_027311   2290   2150.480   0.000000   0.000000
NM_001257177   2344   1695.910   0.120568   6.000000

> head(tx2gene)
TXNAME GENEID
1 uc002qsd.4 1
2 uc002qsf.2 1
3 uc003wyw.1 10
4 uc002xmj.3 100
5 uc010xbn.1 1000
6 uc002kwg.2 1000

ADD REPLY • link 7.2 years ago emmak ▴ 20

1

Entering edit mode

I don't have recommendations for which transcriptome to use. The important thing is to use the FASTA and GTF from the same source and version, and keep track of this information so you can reproduce the analysis later. You can easily make a tx2gene table directly from a GTF file.

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Thank Mike,

how about the error ? Do you see anything wrong ?

ADD REPLY • link 7.2 years ago emmak ▴ 20

1

Entering edit mode

Those transcript names are clearly not the same, so how can the software match on them. I also just recommended that you need to pair the FASTA and GTF from the same source, not try to match post hoc and see if any errors pop up.

You need to do a little more work on your end in between posting questions for the developers. I try to answer questions as soon as possible, but i'm limited if I'm getting pinged too frequently.

ADD REPLY • link 7.2 years ago Michael Love 43k