We are using the DEXSeq package and are following the steps in the workflow vignette (“Inferring differential exon usage in RNASeq data with the DEXSeq package”).
We initially downloaded the Biconductor “Pasilla” Data Package (at http://bioconductor.org/packages/release/data/experiment/html/pasilla.html). Using the sample data files that were provided as part of this package we were able to successfully work our way through the DEXSeq workflow from Section 3 onwards.
Now however, we have gone back and are trying to use the 2 Python scripts described in Section 2 of the DEXSeq workflow, to generate the flattened “.GFF” genome reference (output of the “dexseq_prepare_annotation.py” Python script), and the “.TXT” text files (output of the “dexseq_count.py” Python script).
We downloaded the relevant FastQ files from NCBI – GEO at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508. We then uploaded these to the GALAXY hub, and used the TopHat aligner to produce associated BAM files. When running TopHat in GALAXY, we specified “Use a built-in genome”, and selected “Fruit Fly (Drosophila melanogaster): dm3” as the reference genome to align against.
This produced BAM files where the chromosomes are identified in the format “CHRn” (where “n” is the chromosome name). Therefore, for the input to the “dexseq_prepare_annotation.py” Python script, we need to supply a GTF file where the chromosome names are specified in the “CHRn” format. When we downloaded a GTF reference file from the Ensembl website (“http://www.ensembl.org/info/data/ftp/index.html”), the chromosomes were in the “n” format (rather than “CHRn” format). We then went to the UCSC Table Browser (at “http://genome.ucsc.edu/cgi-bin/hgTables”) and downloaded a GTF reference file for “Drosophila Melanogaster BDGP R5/dm3”. This file had the chromosomes in the correct “CHRn” format. However when you look at the contents of this GTF file, the first few lines look like this:
chr4 dm3_ensGene exon 90287 90475 0.000000 + . gene_id "FBtr0309803"; transcript_id "FBtr0309803";
chr4 dm3_ensGene start_codon 93056 93058 0.000000 + . gene_id "FBtr0309803"; transcript_id "FBtr0309803";
chr4 dm3_ensGene CDS 93056 93220 0.000000 + 0 gene_id "FBtr0309803"; transcript_id "FBtr0309803";
chr4 dm3_ensGene exon 92947 93220 0.000000 + . gene_id "FBtr0309803"; transcript_id "FBtr0309803";
The “gene_id” fields are all in the form “FBtr0309803” when we would have expected them to be in the form “FBgn0309803”, (i.e. FBgn___ rather than FBtr___).
Thus when we tried to use this GTF file in the Python scripts, the results did not come out as expected. We have searched high and low for a GTF file where the “gene_id”s are in the FBgn___ format. We have found a number of sources for GTF files, but it appears that all the ones (e.g. from the Ensembl website) that have the correct “gene_id”, have the chromosome names in the “n’ format (rather than the “CHRn” format). Conversely, all of the GTF files we found that had the chromosome names in the needed ‘CHRn” format, had the “gene_id” form of FBtr___.
Can anyone please advise where we can obtain a GTF reference file in the correct format to feed in to the “dexseq_prepare_annotation.py” Python script for a BAM file that has chromosomes in the “CHRn” format.
Thanks James for your reply. I have followed your detailed instructions above and from looking at the resultant object, it looks like this will solve our problem.
However, when I get to the last step within "R" (> export(z, "tmp.gtf", "gtf")), I get the message: Error: could not find function "export". Is there another library/package (other than "AnnotationHub") that I need to load to make the "export" function available. I tried re-loading Bioconductor, but that didn't help. I assume that "export" must be part of some other library/package.
Also, when I looked at the "z" object both BEFORE and AFTER running the (seqlevelsStyle(z) <- "UCSC") command, it appears that the only difference is that it appended "CHR" to the front of the chromosome names in the "seqnames" column. Is this correct?
Yes, you also need rtracklayer, which provides the
export
function. Sorry for the oversight.Well,
seqlevelsStyle
does more than simply adding a chr to the front of the chromosome names, but for our purposes that's the only part that matters.Thanks for that James.
However, I have run in to another problem. When I went back in to R (actually I'm using RStudio on a Windows PC), and tried to re-run all the commands in your initial solution steps, I successfully completed the first command ( library(AnnotationHub) ). However, when I entered the second command ( hub <- AnnotationHub() ), I get the following error message:
>
> hub <- AnnotationHub()
Error in loadNamespace(name) : there is no package called ‘curl’
In addition: Warning message:
database may not be current
database: ‘C:/Users/gmul3410/Documents/AppData/.AnnotationHub/annotationhub.sqlite3’
reason: there is no package called ‘curl’
>
This command worked the first time I used it yesterday. The only thing that I can think of that has changed since then is that when I was trying to figure out what library I needed to load to get the "export" function working, I re-loaded Bioconductor, which may have installed a new version of Bioconductor, and some of the dependent packages associated with Bioconductor.
So thinking it may now be a compatibility issue between Bioconductor and "R" and maybe RStudio, I upgraded to the latest versions of "R" (Ver 3.4.0), and RStudio (Ver 1.0.143), and Bioconductor (Ver 3.5).
However, rather than fixing the problem, after I upgraded to the latest versions of everything, I now cannot even run the library(AnnotationHub) command. I now get the following error message:
>
> library(AnnotationHub)
Error in library(AnnotationHub) :
there is no package called ‘AnnotationHub’
>
I tried going back to the previous version of "R" (Ver 3.3.3), which was still installed on my PC. This allowed me to successfully run the library(AnnotationHub) command again, but I still get the same error when I try to run the hub <- AnnotationHub() command.
If you ever get an error of the form
That means you need to install.
Thanks James. I have been away for a week or so, but have been able to successfully follow through the processes in your answer, and everything worked well. We used the GTF file: Drosophila_melanogaster.BDGP5.78.gtf (loaded using the command z <- hub[["AH28596"]] )
However, we now have a need to use a different GTF file (Drosophila_melanogaster.BDGP5.25.62.gtf). But I was not able to locate this GTF by querying the AnnotationHub library. However, we have separately downloaded this file from the Ensembl website (ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila_melanogaster) onto a Windows-based file share.
Is there a way to import this (Drosophila_melanogaster.BDGP5.25.62.gtf) file from a Windows file share in to "R" and apply the same "seqlevelsStyle" function to it, (rather than using a command like z <- hub[["AH28596"]] )
Yes.
Then at a terminal prompt
And after the above we get