Correct version of EnsDb.Hsapiens for ensembl transcriptome
1
0
Entering edit mode
emmak ▴ 20
@emmak-14917
Last seen 6.8 years ago

I downloaded the transcriptome:

 ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz , to quantify read counts using Salmon.

I thought it is equivalent to EnsDb.Hsapiens.v86, and created a data frame to map transcript id to gene id. 

txdf <- transcripts(EnsDb.Hsapiens.v86, return.type="DataFrame")

tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])

When I imported the data: txi <- tximport(files, type="salmon", tx2gene=tx2gene) #without using ignoreTxVersion = TRUE

I got the error: 

  Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :   
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

My question is how/where I can obtain the correct version of  EnsDb.Hsapiens for the transcriptome (ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz) ? 

 

Below are the first few lines of a file and tx2gene

> head(tx2gene)
            tx_id         gene_id
1 ENST00000387314     ENSG00000210049
2 ENST00000389680     ENSG00000211459
3 ENST00000387342     ENSG00000210077
4 ENST00000387347     ENSG00000210082
5 ENST00000612848     ENSG00000276345
6 ENST00000386347     ENSG00000209082

> readLines(files[1], 10)

Name    Length    EffectiveLength    TPM    NumReads
ENST00000448914.1    13    5.000    0.000000    0.000000
ENST00000631435.1    12    5.000    0.000000    0.000000
ENST00000434970.2    9    4.000    0.000000    0.000000
ENST00000415118.1    8    4.000    0.000000    0.000000
ENST00000633010.1    16    6.000    0.000000    0.000000
ENST00000632968.1    17    6.000    0.000000    0.000000
ENST00000603693.1    19    7.000    0.000000    0.000000
ENST00000452198.1    18    7.000    0.000000    0.000000
ENST00000632609.1    31    10.000    0.000000    0.000000

 

 

Any help would be greatly appreciated,

Thank you,

 

 

ensembldb ensembl release rnaseq salmon tximport • 3.9k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Can you show head(tx2gene) and also show us the first few lines of one of the quant files:

readLines(files[1], 10)
ADD COMMENT
0
Entering edit mode

Thank you, Mike. I just added additional information. It ran if I used the ignoreTxVersion. But I worry that I may miss some genes because of the mismatch.

ADD REPLY
2
Entering edit mode

You don't have a choice because your tx2gene doesn't contain any version numbers. I don't think there should be duplicate transcripts in the cDNA FASTA with the same ID but different version number.

ADD REPLY
1
Entering edit mode

I agree with Mike, you're using the same Ensembl release, so you can expect that, even if the version number on the tx ID is not present in the EnsDb, they refer to the same transcripts. Ensembl is quite strict and consistent with their IDs.
 

ADD REPLY
0
Entering edit mode

Thank you, Johannes!

ADD REPLY
0
Entering edit mode

Thank Mike, I will go with ignoreTxVersion. 

I tried to download hg38 transcriptome from ucsc (for salmon) but I am not sure which is correct one (please advise ). So, I tried mrna and refMrna (from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/) but I got the above error for both. It looks like they have different IDs from tx2gene. 

Then, I tried the piece of code in your doc (http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) and got the same error. I don't know if I did something wrong. Do you have any suggestion ?

dir <- system.file("extdata", package = "tximportData")
list.files(dir)
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
files <- file.path(dir, "salmon", samples$run, "quant.sf")
names(files) <- paste0("sample", 1:6)
all(file.exists(files))


txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]  # tx ID, then gene ID
head(tx2gene)

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

 

> readLines(files[1], 10)

Name    Length    EffectiveLength    TPM    NumReads
NR_103451    865    706.217    0.048255    1.000000
NM_001243523    577    182.000    0.000000    0.000000
NR_038931    2432    2416.429    0.072372    5.131725
NR_110396    2909    2445.053    1.225229    87.906687
NR_038401    2146    1696.583    0.140607    7.000000
NR_027354    1104    886.567    0.079356    2.064459
NR_027311    2290    2150.480    0.000000    0.000000
NM_001257177    2344    1695.910    0.120568    6.000000

 

> head(tx2gene)
      TXNAME GENEID
1 uc002qsd.4      1
2 uc002qsf.2      1
3 uc003wyw.1     10
4 uc002xmj.3    100
5 uc010xbn.1   1000
6 uc002kwg.2   1000

ADD REPLY
1
Entering edit mode

I don't have recommendations for which transcriptome to use. The important thing is to use the FASTA and GTF from the same source and version, and keep track of this information so you can reproduce the analysis later. You can easily make a tx2gene table directly from a GTF file.

ADD REPLY
0
Entering edit mode

Thank Mike,

how about the error ? Do you see anything wrong ?

 

ADD REPLY
1
Entering edit mode

Those transcript names are clearly not the same, so how can the software match on them. I also just recommended that you need to pair the FASTA and GTF from the same source, not try to match post hoc and see if any errors pop up.

You need to do a little more work on your end in between posting questions for the developers. I try to answer questions as soon as possible, but i'm limited if I'm getting pinged too frequently.

ADD REPLY

Login before adding your answer.

Traffic: 676 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6