Hi all!
I analyzed some RNA-seq samples using kallisto
, to quantify the expression of protein-coding genes, and also Telescope
, to quantify expression of human endogenous retroviruses (HERVs). Ultimately, I would like to concatenate HERV and gene counts per sample, so that I can apply a TMM normalization, followed by an inverse normal transformation to these values collectively, as suggested by GTEx for an eQTL analysis.
The problem is that, whereas there are programs like tximport
to import kallisto
files into a DESeq2
or edgeR
object, the same is not true for the output of Telescope
, so I am a bit unsure how I can uniformly concat these data, before applying the TMM normalization and the inverse normal transformation. My main concern is that tximport produces gene-level estimated counts and an associated edgeR offset matrix, for example, which I don't know how to construct for the Telescope output (but different HERVs will have different lengths!). Do you have any advice on the best practice to merge information from these two quantification tools?
tl;dr:
From what I understand, Telescope outputs raw read counts (integers), corresponding to how many transcripts per specific HERV copy were detected in the RNA-seq data. Therefore, to import these counts to a DESeq2 object, for example, I would import a table with the raw HERV counts using the DESeqDataSetFromMatrix
function. For protein-coding genes, however, I have the impression that a tximport
object corresponding to my gene expression quantification from kallisto, txi.tx$counts
, will be normalized to average gene length (since I summarized transcripts to genes when importing the kallisto files, so it creates an "offset matrix"?). Thus, I don't know if I could simply concat txi.tx$counts
with my table containing HERV raw counts, before applying the TMM normalization, and inverse normal transformation. In terms of HERVs, I've got their putative lengths and I could correct those counts for their lengths (count divided by length - is that what the offset table is doing?), but I don't understand if this would be equivalent to the transcript-length normalization applied by tximport
.
Can anyone shed light onto what you think would be the best practice here?
Hi, Michael Love , thank you so much for your help once again! Can I pick your brain just a bit more? I've got two interweaved problems I think:
1) Re. library size: GTEx suggests performing an eQTL analysis on transformed TMM values (which, according to the edgeR manual, already account for library size "by finding a set of scaling factors for the library sizes that minimizes the log-fold changes between the samples for most genes". So, shouldn't I use
countsFromAbundance="no"
, to obtain the raw counts from the genes, then concat these with the raw HERV counts, and then apply the edgeR TMM normalization (considering HERV + genes, not genes only)? (Then I could apply a gene length normalization next to both genes + HERVs, as HERVs in my annotation also come in different sizes)2) Re. gene length normalization: Doesn't look like GTEx did this for their eQTL analyses, but I feel like there could be benefits in doing so. If we are comparing expression levels of the same gene across AA/Aa/aa genotypes, the gene length bias will be diluted across the conditions, so the overall effect of an eQTL will be obvious independent of the gene length bias (e.g. allele "A" is associated with reduced or increased expression of a gene). However, if we don't correct for gene length, the effect of "A" may not necessarily be directly comparable across multiple genes (e.g., it would not be possible to say that "A" affects gene X more than gene Y in the same tissue). So there may be an advantage in correcting for gene length, but how do I do this without applying the TPM normalization automatically (since I want to use TMM, and this post says it's not a great idea to normalize already normalized data)?
I guess a dirty way of doing this would be by extracting
txi$counts
after runningcountsFromAbundance="no"
, then pass this on to edgeR to construct the TMM values, and divide these values by average gene / HERV length (extracted from the tx2gene object for the genes, and from the Telescope annotation for the HERVs). What do you think?1) Right, that is similar to the median ratio approach we take in DESeq2. You can provide the counts-from-abundance counts to edgeR to then perform TMM for sequencing depth correction. I think this is the easiest way to go given all the integration you have to do here.
2) I don't really follow, I think, the issue with gene length bias for large samples is solely when it is correlated with the condition. Otherwise, I don't think there is a big concern actually.
Re: your last comment, if you do want to correct for gene length, but have to do the integration yourself combining with some other data, I would use counts-from-abundance and pass this to edgeR to construct TMM values. In DESeq2 and in edgeR we perform library size correction after gene length normalization (we just usually do this making use of an offset and the original counts, rather than modifying the counts).