VariantAnnotation function getTranscriptSeqs() gives error "object '.makeUCSCTxListFromGRangesList' not found"
2
0
Entering edit mode
@tjeerd-dijkstra-7350
Last seen 9.1 years ago
Netherlands

 

 

Dear Valerie (and other of the Bioconductor team),

Following toy example used to work with Bioconductor 3.0 but does not work with the current version (3.1):

library(VariantAnnotation)

## simulated data with two genes on a single chromosome
# GFF3 information for two genes with two exons each, one on the + and one on the - strand
gene.name <- paste0(rep("gene", 6), rep(LETTERS[1:2], each=3))
exon.name <- c("", "exon1", "exon2", "", "exon2", "exon1")
exon.rank <- as.integer(c(0, 1, 2, 0, 2, 1))
sim.gr <- GRanges(seqnames=Rle("chr1", 6),
                  ranges=IRanges(rep(c(1, 1, 11), 2), end=rep(c(15, 5, 15), 2),
                                 names=paste0(gene.name, exon.name)),
                  strand=Rle(c("+", "-"), c(3, 3)), score=rep(0, 6),
                  phase=rep(0, 6), type=rep(c("mRNA", "CDS", "CDS"), 2),
                  Parent=gene.name, exon_rank=exon.rank)
CDS.flag <- sim.gr$type == "CDS"
sim.grl <- split(sim.gr[CDS.flag], substr(names(sim.gr[CDS.flag]), 1, 5))
# simulated genome with single chromosome of length 20
sim.genome.ss <- DNAStringSet(c("AAAAACCCCCGGGGGTTTTT"))
names(sim.genome.ss) <- c("chr1")
writeXStringSet(sim.genome.ss, "Sim.Genome.fasta")
sim.genome.fa <- open(FaFile("Sim.Genome.fasta"))
# transcriptome from genome, exon_rank column is necessary for correct order
#   of exons on negative strand
sim.trans.ss <- getTranscriptSeqs(sim.grl, sim.genome.fa)

Output from sessionInfo()

R version 3.2.1 (2015-06-18)

Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.14.10 Rsamtools_1.20.4          Biostrings_2.36.3         XVector_0.8.0             GenomicFeatures_1.20.1   
 [6] AnnotationDbi_1.30.1      Biobase_2.28.0            GenomicRanges_1.20.5      GenomeInfoDb_1.4.1        IRanges_2.2.7            
[11] S4Vectors_0.6.3           BiocGenerics_0.14.0       BiocInstaller_1.18.4     

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.20     BSgenome_1.36.3         tools_3.2.1            
 [6] DBI_0.3.1               lambda.r_1.1.7          futile.logger_1.4.1     rtracklayer_1.28.7      futile.options_1.0.0   
[11] bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0          RSQLite_1.0.0           XML_3.98-1.3   

I get exactly the same error on my Mac running OS X Mavericks. Any suggestions as to what I am doing wrong would be nice.

Tjeerd

variantannotation • 2.5k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.7 years ago
United States

Hi,

This was my fault for not adding a deprecation message to all getTranscriptSeqs() methods. This function has been replaced with GenomicFeatures::extractTranscriptSeqs(); it's deprecated in release (BioC 3.1) and defunct in devel (BioC 3.2).

I've added the message and fixed the dispatch in VariantAnnotation 1.14.11. Should be available via biocLite() by tomorrow ~ noon.

Thanks for reporting the bug.

Valerie

ADD COMMENT
0
Entering edit mode

One more thing. When I run your code with 1.14.11 I get this error from GenomicFeatures::extractTranscriptSeqs():

> sim.trans.ss <- getTranscriptSeqs(sim.grl, sim.genome.fa)
Error in .check_exon_rank(tx1) : 
  "exon_rank" inner metadata column in GRangesList object 'transcripts'
  does not contain increasing consecutive ranks for some transcripts
In addition: Warning message:
'getTranscriptSeqs' is deprecated.
Use 'extractTransciptSeqs' instead.
See help("Deprecated") 

extractTranscriptSeqs() handles exons on the negative strand - there's no need to re-order them such that the rank is decreasing for the negative strand. 

Valerie

 

ADD REPLY
0
Entering edit mode

Hi Valerie,

Thank you for your quick answer. Code now certainly runs by using GenomicFeatures::extractTranscriptSeqs() and getting rid of the exon_rank feature. However, extractTranscriptSeqs() gives a different output from gffread (from the Cufflinks package) for negative strand genes. Specifically, it seems to take the reverse complement of each exon but orders the exons same as on the plus strand. Here is the same toy example, slightly elaborated:

## simulated data with two genes on a single chromosome
library(GenomicFeatures); library(Rsamtools); library(rtracklayer)
# GFF3 information for two genes with two exons each, one on the + and one on the - strand
gene.name <- paste0(rep("gene", 6), rep(LETTERS[1:2], each=3))
exon.name <- c("", "exon1", "exon2", "", "exon2", "exon1")
sim.gr <- GRanges(seqnames=Rle("chr1", 6),
                  ranges=IRanges(rep(c(1, 1, 11), 2), end=rep(c(15, 5, 15), 2),
                                 names=paste0(gene.name, exon.name)),
                  strand=Rle(c("+", "-"), c(3, 3)), score=rep(0, 6),
                  phase=rep(0, 6), type=rep(c("mRNA", "CDS", "CDS"), 2),
                  Parent=gene.name)
# split CDS part on gene.name to make a list of two genes with two exons each
CDS.flag <- sim.gr$type == "CDS"
sim.grl <- split(sim.gr[CDS.flag], substr(names(sim.gr[CDS.flag]), 1, 5))
# simulated genome with single chromosome of length 20
sim.genome.ss <- DNAStringSet(c("TAAAACGCCCGGCGGTTTAT"))
names(sim.genome.ss) <- c("chr1")
writeXStringSet(sim.genome.ss, "Sim.Genome.fasta")
sim.genome.fa <- open(FaFile("Sim.Genome.fasta"))
# transcriptome from genome
sim.trans.ss <- extractTranscriptSeqs(sim.genome.fa, sim.grl)
writeXStringSet(sim.trans.ss, "sim.trans.biocond.fasta")

# Parent column is necessary for correct grouping of exons by gffread
export.gff3(sim.gr, "Sim.gff3")
# generate transcriptome FASTA from genome FASTA and GFF/GTF file with
# /Applications/BioInformatics/cufflinks-2.2.1/gffread \
#   -w sim.trans.gffread.fasta -g Sim.Genome.fasta Sim.gff3

with file sim.trans.bicond.fasta reading:

>geneA
TAAAAGGCGG
>geneB
TTTTACCGCC

with file sim.trans.gffread.fasta reading:

>geneA CDS=1-10
TAAAAGGCGG
>geneB CDS=1-10
CCGCCTTTTA

 

ADD REPLY
0
Entering edit mode

The problem is with the order of the exons on the "-" strand.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exons <- exonsBy(txdb, by="tx")


The GenomicFeatures extractors always return exons (or any feature) in the order of transcription. For the "-" strand this means the first exon (ie, rank 1) is the most 3' wrt the "+" strand and the last is the most 5' wrt the "+" strand.

> exons[4074]

GRangesList object of length 1:
$4074 
GRanges object with 4 ranges and 3 metadata columns:
      seqnames         ranges strand |   exon_id   exon_name exon_rank
         <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 [16607, 16765]      - |     13970        <NA>         1
  [2]     chr1 [15796, 15942]      - |     13968        <NA>         2
  [3]     chr1 [14970, 15038]      - |     13966        <NA>         3
  [4]     chr1 [14362, 14829]      - |     13964        <NA>         4


Changing the order of the "-" strand exons in sim.grl gives the same answer you see with Cufflinks.

> sim.grl2 = GRangesList(sim.grl[[1]], rev(sim.grl[[2]]))
> sim.trans.ss <- extractTranscriptSeqs(sim.genome.fa, sim.grl2)
> sim.trans.ss
  A DNAStringSet instance of length 2
    width seq
[1]    10 TAAAAGGCGG
[2]    10 CCGCCTTTTA


Valerie

 

ADD REPLY
0
Entering edit mode

 

Hi Valerie,

This works indeed, thank you. It is actually in the examples of the manual for extractTranscriptSeqs(), in the TOY EXAMPLE as 

extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")

However, the fact that tx2 needs to be reversed is not commented upon, could be stressed more IMHO. More generally, it would be nice to add an example based on GFF annotation files. For people working outside of human and standard model organisms (mouse, drosophila, yeast etc) GFF file are often the standard annotation files. It would be nice if there were some more relevant examples in the Bioconductor documentation.

Take care from Tjeerd

ADD REPLY
0
Entering edit mode

Hi Tjeerd,

Thanks for your feedback. I'll add something to the man page for extractTranscriptSeqs() about this.

Note that when working with GFF/GTF files, the recommended path is to first load the annotations into a TxDb object with makeTxDbFromGFF(), and then to extract the exons grouped by transcript with exonsBy()exonsBy() takes care of returning the exons sorted by ascending rank for each transcript (that is, ordered from 5' to 3' in transcript space), whatever the strand is, so the user doesn't need to worry about getting the exons in the right order before calling extractTranscriptSeqs().

Hope this helps,

H,

ADD REPLY
0
Entering edit mode

Done in GenomicFeatures 1.20.3 (release) and 1.21.17 (devel).

Cheers,

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

This certainly helped, thank you very much. One more qualm: my toy example is intentionally structured as a GeneDB GFF3 file (I use  Pknowlesi.noseq.gff3) . These files have CDS feature types but *no* exon features. Such a GFF3 file leads to the following error in exonsBy (or cdsBy):

Error in .make_splicings(exons, cds, stop_codons) :   some CDS cannot be mapped to an exon

When I change "CDS" to "exon" all is fine. Of course I can run a global find and replace to replace all occurrences of "CDS" with "exon" in the file but this seems rather inelegant (especially since the gffread utility works fine with the original file).

More generally, I actually read the GenomicFeatures vignette a while ago but gave up on this route when I saw that I needed to compile my genome into a BSGenome package. This is a misreading from my end but none of the worked-out examples in the vignette were relevant for my use case. While excellent, your documentation is a bit skewed to the human/model organism researchers who are probably in the majority. It would be nice to have an example in the manual where one starts with just a genome FASTA file and a GFF annotation file. I study a monkey malaria (Plasmodium knowlesi) and a monkey host (Macaca mulatta) and for both organisms information is distributed as genome FASTA with GFF3/GTF annotation file. I imagine this use case would apply to many more investigators.

ADD REPLY
0
Entering edit mode

Hi Tjeerd,

makeTxDbFromGFF() supports some kind of GFF files where CDS are direct children of genes. In that case the exons and transcripts are inferred from the CDS. The GenomicFeatures package actually contains such a file (used in the unit tests):

filepath <- system.file("extdata", "GFF3_files", "NC_011025.gff", package="GenomicFeatures")
txdb <- makeTxDbFromGFF(filepath)

Your file doesn't seem to fall into this category though but I'd be happy to look at it and try to make makeTxDbFromGFF() work on it. Can you please provide the link to the Pknowlesi.noseq.gff3 file? Also it would help if you could show the exact commands you used that caused this error.

Thanks,

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

Here is some code that compares the transcriptome from bioconductor with the one from gffread. For Mmul this works fine but I am struggling with Pkno. An (indirect) link to the Pknowlesi.noseq.gff is from ftp.sanger.ac.uk/pub/project/pathogens/gff3/CURRENT

# source("TransFromGFF.R")
# extracting transcripts from a genome FASTA file using GFF annotation files
#
# Version 0.4, 23 August 2015 by Tjeerd Dijkstra
# Tested with R 3.2.2 and Bioconductor 3.1 under OS X 10.9.5 on an i7-2635QM@2GHz

library(GenomicFeatures)

## transcriptome from genome using TxDb for Macaca mulatta
# Mmul genome and annotation from www.unmc.edu/rhesusgenechip/index.htm#NewRhesusGenome
Mm.txdb <- makeTxDbFromGFF("~/Expres/RUMPHI/Mmul-Genome/Mmul_Annotation_v7.6.8.gtf")
# Mm.tx.grl <- cdsBy(Mm.txdb, by = "tx", use.names = TRUE)
Mm.tx.grl <- exonsBy(Mm.txdb, by = "tx", use.names = TRUE)
Mm.genome.fa <- open(FaFile("~/Expres/RUMPHI/Mmul-Genome/Mmul_Genome_v7.fasta"))
Mm.trans.biocond <- extractTranscriptSeqs(Mm.genome.fa, Mm.tx.grl)
writeXStringSet(Mm.trans.biocond, "Mm.trans.biocond.fasta")

# generate transcriptome FASTA from genome FASTA and GTF file with gffread
system2("/Applications/BioInformatics/cufflinks-2.2.1/gffread",
        args = paste("-w Mm.trans.gffread.fasta",
                     "-g ~/Expres/RUMPHI/Mmul-Genome/Mmul_Genome_v7.fasta",
                     "~/Expres/RUMPHI/Mmul-Genome/Mmul_Annotation_v7.6.8.gtf"))
# compare bioconductor transcriptome with gffread one
Mm.trans.gffread <- readDNAStringSet("Mm.trans.gffread.fasta")
# gffread adds stuff to the name of transcripts which I cut off by some grep magic
match.index <- match(names(Mm.trans.biocond), gsub("(\\ ).*", "", names(Mm.trans.gffread)))
# names match except for CNTFR_transcript_01 that exonsBy warns about
cat(sprintf("Number of names of bioconductor and gffread transcriptome that do not match %d\n",
            sumis.na(match.index))))
nchar.biocond <- nchar(Mm.trans.biocond); nchar.gffread <- nchar(Mm.trans.gffread[match.index])
# all lengths are the same ... great
length.differ.flag <- (nchar.biocond != nchar.gffread)
cat(sprintf("Number of lengths of bioconductor and gffread transcriptome that do not match %d\n",
            sum(length.differ.flag)))

## transcriptome from genome using TxDb for Plasmodium knowlesi
# data downloaded by following the ftp link on www.genedb.org/Homepage/Pknowlesi
# Error in .make_splicings(exons, cds, stop_codons): some CDS cannot be mapped to an exon
Pk.txdb <- makeTxDbFromGFF("~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.noseq.gff3")

# # following attempt to filter out non-protein coding genes also fails
# Pk.gene.cds.gr <- Pk.gr[Pk.gr$type == "gene" | Pk.gr$type == "mRNA"| Pk.gr$type == "CDS"]
# Pk.txdb <- makeTxDbFromGRangesPk.gene.cds.gr)
Pk.tx.grl <- cdsBy(Pk.txdb, by = "tx", use.names = TRUE)
Pk.genome.fa <- open(FaFile("~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.genome.fasta"))
Pk.trans.biocond <- extractTranscriptSeqs(Pk.genome.fa, Pk.tx.grl)
writeXStringSet(Pk.trans.biocond, "Pk.trans.biocond.fasta")

# generate transcriptome FASTA from genome FASTA and GFF/GTF file with gffread
system2("/Applications/BioInformatics/cufflinks-2.2.1/gffread",
        args = paste("-w Pk.trans.gffread.fasta",
                     "-g ~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.genome.fasta",
                     "~/Expres/RUMPHI/Pknow-GeneDB/Pknowlesi.noseq.gff3"))
Pk.trans.gffread <- readDNAStringSet("Pk.trans.gffread.fasta")
# gffread adds stuff to the name of transcripts which I cut off by some grep magic
match.index <- match(names(Pk.trans.biocond), gsub("(\\ ).*", "", names(Pk.trans.gffread)))
# names match except for CNTFR_transcript_01 that exonsBy warns about
cat(sprintf("Number of names of bioconductor and gffread transcriptome that do not match %d\n",
            sumis.na(match.index))))
nchar.biocond <- nchar(Pk.trans.biocond); nchar.gffread <- nchar(Pk.trans.gffread[match.index])
# all lengths are the same ... great
length.differ.flag <- (nchar.biocond != nchar.gffread)
cat(sprintf("Number of lengths of bioconductor and gffread transcriptome that do not match %d\n",
            sum(length.differ.flag)))

Thank you once more for looking into this! Tjeerd

ADD REPLY
0
Entering edit mode

Hi Tjeerd,

Thanks for providing the file. I'll look at it and will get back here once I have a solution.

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

Not in a hurry but it would be nice to have a solution for my struggles with the GeneDB GFF3 files.

Take care,

Tjeerd

ADD REPLY
0
Entering edit mode

I'm on it. Thanks for your patience.

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

Did you manage to get to it?

Cheers from Tjeerd

ADD REPLY
0
Entering edit mode

Not yet sorry. Other urgent matters keep showing up. I won't be able to tackle this before next week. Thanks again for your patience.

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

Now that BioConductor 3.2 is out, would you have time for my (little) issue?

Take care,

Tjeerd

ADD REPLY
0
Entering edit mode

Hi Tjeerd,

I finally had a chance to add support for the GFF files from GeneDB to makeTxDbFromGFF(). This is in GenomicFeatures 1.22.1, which will become available in BioC 3.2 (the current release) over the weekend.

Sorry that it took so long and thanks for your patience. Please don't hesitate to let us know if you run into any further problem.

Thanks,

H.

ADD REPLY
0
Entering edit mode

 

Hi Herve,

The new makeTxDbFromGFF() does fine with my GeneDB GFF file and gives the same transcripts as the gffread utility. Thank you for your efforts, this makes manipulation of genomic features much easier for me.

Take care,

Tjeerd

ADD REPLY
0
Entering edit mode

This is good news. Thanks for the feedback! I really appreciate the fact that you took the time to compare with what gffread gives you.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6