Hi, I have isoform-level deep RNA-seq data from stringtie on about 800 people. I have created gene-level and isoform-level data using tximport. I have just looked at the voom mean-variance trend plots. At the gene level, the plot looks as usual (decreases at low expression levels and then levels off). For the isoform data, the plot looks different, the fit curve continues to decrease all along the x-axis without ever leveling off (so no variance stabilization). I have done this both with loose and quite stringent filtering (requiring at least 15 reads in 75% of individuals for the stringent filtering), and in both cases the plot looks the same. We do not need variance stabilization since we use the voom weights in subsequent analyses, but is this type of plot of any concern? Has this been seen for other isoform data? As expected the isoform data points cluster strongly on the left side of the graph, while the gene-level data cluster more in the middle. Thank you for any comments.
Michael and Gordon, I very much appreciate your comments. I fully realize that voom (and other methods) were designed for gene-level analysis. Stringtie was used because one of the goals is to discover novel isoforms. I am planning to use Salmon at least on a subset of the samples (n=855), but I have not yet figured out what to do with the results from Salmon. First, Salmon can be run with SA or using the bam files from STAR alignment (which I have). Which of these two options do you recommend, I assume that bootstrapping can be performed with both. Second, once I have the bootstrap results from salmon/tximport, then what to do with them? You seem to have some ideas on how to incorporate the variance inflation into voom? While you mostly raise the issue of power, the 2017 NMETH sleuth paper seems to mostly focus on false positives, but I may not quite understand this paper yet: They estimate an inferential variance from bootstrapping, subtract this inferential variance from the total variance to obtain a biological variance which is then regularized, but given my large sample size regularization should not matter and in that case I do not see how their method would make any difference. So in summary, yes I plan to use Salmon, but it is not trivial to figure out what to do with the results from Salmon to improve differential expression analysis. Would be grateful for any suggestions …
Regarding SA and STAR, this is addressed in depth here:
https://www.biorxiv.org/content/10.1101/657874v2
We have published a paper and have a Bioconductor package with a method Swish for using inferential uncertainty in analysis:
https://academic.oup.com/nar/article/47/18/e105/5542870