Hello A/Prof Love,
Hope you are well and looking forward to CSAMA in the Italian Alps!
After reading your comprehensive machine learning slides from BIOS 735 - Introduction to Statistical Computing (plus the HarvardX youtube videos with Prof Rafael Irizarry), we were hoping for one of the ML examples to use differentially expressed genes from RNA Sequencing analysis (please point in right direction if it exists, and have accidentally missed it - sorry).
Essentially we would like to identify differentially expressed genes using DESeq2 to build a binomial classifier (elastic net) with the top DE genes.
Project:
- Cohort: 100 disease, 100 control
- Specimen: Human plasma samples.
- Pipeline: Nextflow/rnaseq (Salmon/STAR transcriptome alignment)
What count matrix/file from tximport or DESeq2 would be advised for an ML classifier. The emphasis is on a parsimonous set of genes, that are robust, reproducible and controlled uncertainty. (Is this the trillion dollar question, we only have a billion lol).
Also, if we supply a design matrix (including technical confounders to adjust degrees of freedom) to DESeq2, should we use this corrected matrix as values for the ML classifier (would this data be in the dds)? Or, should we be using a matrix that would have been transformed by vst or rlog?
Happy to post this to Bioconductor if you prefer, it started off as more of an ML query related to your BIOS 375 course.
Thanks in advance for any insight,
Chris
Thank you Mike.
Re: "The VST or scaled counts should be fine."
So either the DESeqTransform object following VST, or the txi object (txi <- tximport(files, type="salmon", tx2gene=tx2gene)) before "DESeqDataSetFromTximport" can be used?
Re: The last point you highlighted: if we do not use limma's removeBatchEffects (with batch and design arguments), might the flow hypothetically look like this?
dds <- DESeqDataSetFromTximport(txi,
vsd <- vst(dds, blind = FALSE)
Then use the transformed values within vsd in ML
Also, we have a wide range of sequencing depth in our human plasma samples after umi-deduplication (2million - 10 million). Would you recommend rlog as more appropriate or to stick with the vst due to cohort size? We don't mind the slowness (slow is smooth and smooth is fast) but also note the comments in your guide, under which transformation to choose.
Thanks for any comments you might have on that aspect.
Hope CSAMA went well.
Scaled counts would be
counts(dds, normalized=TRUE)
yes.
Stick with VST, we prefer this one post the 2014 publication.
Thanks again Mike. Your comments and contribution are, as always, invaluable to the community.
Lastly, in relation to using vst, could you explain how vst (as a unit) can be explained when referencing the exponentiated elastic net coefficients (i.e odds ratios) ?
VST is asymptotic log2 counts, so you can think of the coefficients in a linear model as LFCs.