MLSeq Mathematical Concepts
1
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 2 days ago
Australia
Hello, >From reading the vignette, MLSeq seems to be a set of wrapper functions that allows the user easy access to normalisation strategies in edgeR or DEseq and passes the data onto algorithms such as Support Vector Machine or Random Forest. Are there any results that demonstrate that normalisation improves classification performance ? I am also not convinced about the description of using voom weights to transform the data. The author of voom stated that specialised clustering and classification algorithms are needed to handle the CPM and weights separately. Why does MLSeq use standard classification algorithms and how were the weights and expression values combined ? -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
Classification edgeR DESeq MLSeq Classification edgeR DESeq MLSeq • 1.7k views
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…
Dear Dario good points, and as usual in machine learning, I don?t expect there to be a simple answer or universally best solution. For classification, the (pre)selection of features (genes) used is probably more important than most other choices, esp. if the classification task is simple and can be driven by a few genes. For clustering, similar, plus the choice of distance metric or embedding. That said, it is plausible that both, using the untransformed counts (or RPKMs etc.), or the log-transformed values, have problems with high variance (either at the upper or lower end of the dynamic range) that can be avoided with a different transformation, log-like for high values, linear-like for low (e.g. DESeq2?s vst, rlog). Paul McMurdie and Susan Holmes have some on this in their waste-not-want-not paper [1], and Mike in a Supplement to the DESeq2 paper (draft). It would be interesting to collect more examples, and someone should probably study this more systematically (if they aren?t already.) Kind regards Wolfgang [1] http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal .pcbi.1003531 [2 http://www-huber.embl.de/DESeq2paper ?> Regularized logarithm for sample clustering (As of today, there is a version of 19 February which I think will soon be updated with a more extensive survey). Il giorno 23 Apr 2014, alle ore 07:00, Dario Strbenac <dstr7320 at="" uni.sydney.edu.au=""> ha scritto: > Hello, > >> From reading the vignette, MLSeq seems to be a set of wrapper functions that allows the user easy access to normalisation strategies in edgeR or DEseq and passes the data onto algorithms such as Support Vector Machine or Random Forest. Are there any results that demonstrate that normalisation improves classification performance ? I am also not convinced about the description of using voom weights to transform the data. The author of voom stated that specialised clustering and classification algorithms are needed to handle the CPM and weights separately. Why does MLSeq use standard classification algorithms and how were the weights and expression values combined ? > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Dear Dario, I think you are right about being careful to simply use the voom weights to pre-transform the data. As Dr. Smyth pointed out a while ago, an algorithm should always use these weights explicitly in some way rather than using them to pretransform the data. You could possibly incorporate them easily in a DDA classifier for example. Apart from Wolfgangs links, I might point you to two interesting papers Zwiener et. al. - Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures [1] http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone .0085150 They investigate a couple of pretransformations for RNA-Seq data classification and find that rank based transformation perform well in general. (They do not consider voom weights) [2] Gallopin et.al. - A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.007 7503 They use a GLMM combined with a lasso penalty to incorporate unequal sample variances and then estimate a graphical model using a type of partial correlation. This is somewhat similar to the voom approach, however the variances and the model parameters are estimated in "one-go". However, they note that the algorithm used is very slow. Best wishes, Bernd On Apr 23, 2014, at 9:43 AM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > Dear Dario > > good points, and as usual in machine learning, I don?t expect there to be a simple answer or universally best solution. > For classification, the (pre)selection of features (genes) used is probably more important than most other choices, esp. if the classification task is simple and can be driven by a few genes. For clustering, similar, plus the choice of distance metric or embedding. > > That said, it is plausible that both, using the untransformed counts (or RPKMs etc.), or the log-transformed values, have problems with high variance (either at the upper or lower end of the dynamic range) that can be avoided with a different transformation, log-like for high values, linear-like for low (e.g. DESeq2?s vst, rlog). Paul McMurdie and Susan Holmes have some on this in their waste-not-want-not paper [1], and Mike in a Supplement to the DESeq2 paper (draft). It would be interesting to collect more examples, and someone should probably study this more systematically (if they aren?t already.) > > Kind regards > Wolfgang > > > [1] http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjourn al.pcbi.1003531 > [2 http://www-huber.embl.de/DESeq2paper ?> Regularized logarithm for sample clustering (As of today, there is a version of 19 February which I think will soon be updated with a more extensive survey). > > > > > Il giorno 23 Apr 2014, alle ore 07:00, Dario Strbenac <dstr7320 at="" uni.sydney.edu.au=""> ha scritto: > >> Hello, >> >>> From reading the vignette, MLSeq seems to be a set of wrapper functions that allows the user easy access to normalisation strategies in edgeR or DEseq and passes the data onto algorithms such as Support Vector Machine or Random Forest. Are there any results that demonstrate that normalisation improves classification performance ? I am also not convinced about the description of using voom weights to transform the data. The author of voom stated that specialised clustering and classification algorithms are needed to handle the CPM and weights separately. Why does MLSeq use standard classification algorithms and how were the weights and expression values combined ? >> >> -------------------------------------- >> Dario Strbenac >> PhD Student >> University of Sydney >> Camperdown NSW 2050 >> Australia >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
I think it would be great to capture these ideas and brief implementation as a 'Sequence Analysis Machine Learning' workflow http://bioconductor.org/help/workflows/ that re-uses some pre-existing but real data like http://bioconductor.org/packages/release/data/experiment/html/RNAseqDa ta.HNRNPC.bam.chr14.html or other new / existing Experiment Data packages http://bioconductor.org/packages/release/BiocViews.html#___Experime ntData The basic requirement for a work flow is an R markdown (Rmd or Rnw) document; Sonali (cc'd) has agreed to help with the technical aspects needed to 'make this so'. Any takers (on or off-list)? Martin On 04/23/2014 03:27 AM, Bernd Klaus wrote: > Dear Dario, > > I think you are right about being careful to simply use the voom weights > to pre-transform the data. As Dr. Smyth pointed out a while ago, an algorithm > should always use these weights explicitly in some way rather than using them to > pretransform the data. > > You could possibly incorporate them easily in a DDA classifier for example. > > Apart from Wolfgangs links, I might point you to two interesting papers > > Zwiener et. al. - Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures > [1] http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.po ne.0085150 > > They investigate a couple of pretransformations for RNA-Seq data classification and find > that rank based transformation perform well in general. (They do not consider voom weights) > > > [2] Gallopin et.al. - A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data > http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0 077503 > > They use a GLMM combined with a lasso penalty to incorporate unequal sample variances and then > estimate a graphical model using a type of partial correlation. > > This is somewhat similar to the voom approach, however the variances and > the model parameters are estimated in "one-go". However, they note that the > algorithm used is very slow. > > Best wishes, > > Bernd > > > On Apr 23, 2014, at 9:43 AM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > >> Dear Dario >> >> good points, and as usual in machine learning, I don?t expect there to be a simple answer or universally best solution. >> For classification, the (pre)selection of features (genes) used is probably more important than most other choices, esp. if the classification task is simple and can be driven by a few genes. For clustering, similar, plus the choice of distance metric or embedding. >> >> That said, it is plausible that both, using the untransformed counts (or RPKMs etc.), or the log-transformed values, have problems with high variance (either at the upper or lower end of the dynamic range) that can be avoided with a different transformation, log-like for high values, linear-like for low (e.g. DESeq2?s vst, rlog). Paul McMurdie and Susan Holmes have some on this in their waste-not-want- not paper [1], and Mike in a Supplement to the DESeq2 paper (draft). It would be interesting to collect more examples, and someone should probably study this more systematically (if they aren?t already.) >> >> Kind regards >> Wolfgang >> >> >> [1] http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjour nal.pcbi.1003531 >> [2 http://www-huber.embl.de/DESeq2paper ?> Regularized logarithm for sample clustering (As of today, there is a version of 19 February which I think will soon be updated with a more extensive survey). >> >> >> >> >> Il giorno 23 Apr 2014, alle ore 07:00, Dario Strbenac <dstr7320 at="" uni.sydney.edu.au=""> ha scritto: >> >>> Hello, >>> >>>> From reading the vignette, MLSeq seems to be a set of wrapper functions that allows the user easy access to normalisation strategies in edgeR or DEseq and passes the data onto algorithms such as Support Vector Machine or Random Forest. Are there any results that demonstrate that normalisation improves classification performance ? I am also not convinced about the description of using voom weights to transform the data. The author of voom stated that specialised clustering and classification algorithms are needed to handle the CPM and weights separately. Why does MLSeq use standard classification algorithms and how were the weights and expression values combined ? >>> >>> -------------------------------------- >>> Dario Strbenac >>> PhD Student >>> University of Sydney >>> Camperdown NSW 2050 >>> Australia >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6