We're analyzing RNAseq data with a pipeline consisting of Salmon, tximeta, and DESeq2.
We have a multi-factorial experimental design, and the experiment was performed on cell lines.
On thing that surprised us is that in the result output, we observe many gene polymorphisms.
For example, for gene NLRP2 we observed multiple entries associated with different ensembl IDs ENSG00000022556, ENSG00000275082, ENSG00000275843, etc.
My question is how do we interpret data like this? And how to deal with this kind of situation? Can we add/average different entries associated with the same gene?
Thanks so much for the clarification Michael.
I was indeed confused by the alternative scaffolds included in ensembl genome.
Now that you've mentioned it, I will rebuild salmon index with GENCODE reference transcriptome.
Oh and a further recommendation, when you use Salmon to index, specify
--gencode
which will clean the transcript names in the Salmon output.Thank you!!
Indeed I included the
--gencode
flag by following a tutorial from here https://biocorecrg.github.io/RNAseq_course_2019/salmon.html :)Right now I'm trying to extract some extra information (i.e. gene symbol, description, etc.) from the rowRanges slot. When I was using ensembl genome reference, these were automatically appended to the
SummarizedExperiment
object fromAnnotationHub
, but with GENCODE genome these information were missing.I've tried the
makeLinkedTxome()
function to link a local gencode gtf file but it didn't seem to work.Now I'm reading this vignette https://biodatascience.github.io/compbio/bioc/SE.html to see if I can add these back directly from the gencode gtf file. Any suggestions?
Have you tried
addIds
from tximeta package?Just tried
addIds
and it worked, thanks a lot Michael!