The feautre protein_coding doesn't colour the first and only protein coding transcript. One of the transcript is a processed transcript but has no protein, while the shortest is non-coding. It appears that the function wrongly identified the coding sequences. Any hint why?
this is not really an issue with Gviz, but rather with the Ensembl Biomart. They seem to store the function on the level of the gene, not transcript. So if there is a single protein coding transcript, the gene will be annotated as protein_coding. And that is exactly what you are seeing in the plot. You can try that out yourself by downloading the data through the Ensembl Biomart: (http://www.ensembl.org/biomart/martview)
The attribute that stores the type is called "gene_biotype". One could consider a more evolved algorithm to figure out whether a transcript is coding or non-coding by looking at the CDS start and end locations if they are available, however that will take a bit of restructuring of code.
Looking at the available annotation features in Biomart it looks like we could use the CDS length field as an indicator whether a transcript is indeed protein coding. That should be empty for these cases. Will look at this once I find a couple of free minutes and provide a patch.
so in my case, which argumentI should use after the update to pull the info about the transcripts? Will it be a lot to change in the current code? I assume I can delete protein_coding for sure.
so in my case, which argumentI should use after the update to pull the info about the transcripts? Will it be a lot to change in the current code? I assume I can delete protein_coding for sure.
You won't have to change anything. Just wait for version 1.16.4 to appear on the package server and update. It should happen sometime later today. The last SVN snapshot on the build system was taken two days ago, so a new version must be building right now.
Looks like there was a corrupted file checked in to the svn. The package builds locally but failed in the Bioc build system. Just committed a fix (I hope). Will keep an eye on this and let you know.
For your particular example I can see that only the longest transcript has coding regions. The two smaller ones are fully non-coding, and the long one also has 3' and 5' UTRs:
Looking at the available annotation features in Biomart it looks like we could use the CDS length field as an indicator whether a transcript is indeed protein coding. That should be empty for these cases. Will look at this once I find a couple of free minutes and provide a patch.
Florian
Ok, that seems to do the trick. Should become available with the next package update in a couple of days.
Hi,
so in my case, which argumentI should use after the update to pull the info about the transcripts? Will it be a lot to change in the current code? I assume I can delete protein_coding for sure.
Thanks for fast response,
Simon