Hi all,
I performed reads alignment using STAR on Ensembl genome with the --quantMode geneCount. I re-organised ReadsPerGene.out.tab and extracted unstranded counts to create a count matrix. I used this count matrix for DEG analysis via DESeq2, but also wanted to generate TPM to input for ssGSEA analyses. To generate TPMs, I followed the formula:
t( t(counts.mat / gene.length) * 1e6 / colSums(counts.mat / gene.length) )
I estimated gene length via the Ensembldb::lengthof function, where:
"the length is the sum of the lengths of all exons of a transcript or a gene. In the latter case the exons are first reduced so that the length corresponds to the part of the genomic sequence covered by the exons."
ssGSEA results on these TPM was quite consistent with literature observations, but my question is whether the approach I took can be considered valid?
Thanks.
Are all exons expressed equally in your samples? I don't get how people generate TPM without knowing the proportion of transcripts for their samples.
Highly doubt that. We ran the pipeline given to us by our bioinformatician, and clearly is wrong. Glad I doubted it.