scholarly reference for "don't draw PCA/heatmap dendrograms on DEGs"

0

Entering edit mode

Aaron Mackey ▴ 200

@aaron-mackey-3833

Last seen 10.3 years ago

A colleague of mine is skeptical of my assertion that drawing sample- level PCA plots and/or clustered heatmaps based only on differentially expressed genes (DEGs) is a circular, self-fulfilling prophecy -- they assert that there's no guarantee samples will cluster by condition (despite the fact that the condition is exactly what drives selection of DEGs), and so hopes to use the observed clustering as further "evidence" of the condition effects. Rather than spend more time trying to explain statistical concepts, I was hoping to checkmate the argument with a nice Nature Methods review or somesuch. Any pointers? Thanks in advance, -Aaron [[alternative HTML version deleted]]

Clustering Clustering • 2.9k views

ADD COMMENT • link updated 11.4 years ago by Malcolm Cook ★ 1.6k • written 11.4 years ago by Aaron Mackey ▴ 200

0

Entering edit mode

Lorena Pantano ▴ 140

@lorena-pantano-6001

Last seen 11 months ago

Boston

Hi, I don't have any reference to give you. But my experience says that you don't get necessary a good heatmap separated by two conditions although you use only DE genes. Probably because many time,s results from DE genes are not so strong to separate the two groups, or because there is a systematically outlier in your comparison and get DE genes that are not true, or any other reason. I can say that I have done more than 50 DE analysis, and only once, I got a clear heatmap showing two groups. So, I guess there is something there. very interesting your initiative. cheers Lo On Mon, Dec 9, 2013 at 2:19 PM, Aaron Mackey <ajmackey@gmail.com> wrote: > A colleague of mine is skeptical of my assertion that drawing sample-level > PCA plots and/or clustered heatmaps based only on differentially expressed > genes (DEGs) is a circular, self-fulfilling prophecy -- they assert that > there's no guarantee samples will cluster by condition (despite the fact > that the condition is exactly what drives selection of DEGs), and so hopes > to use the observed clustering as further "evidence" of the condition > effects. Rather than spend more time trying to explain statistical > concepts, I was hoping to checkmate the argument with a nice Nature Methods > review or somesuch. Any pointers? > > Thanks in advance, > -Aaron > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago Lorena Pantano ▴ 140

0

Entering edit mode

I don't have a good reference either. But you can easily simulate matrices full of IID standard normal data, pick the "most differentially expressed" and show that this noise/nonsense perfectly separates any two "groups" that you want to pretend is present in the data. -- Kevin On 12/9/2013 8:55 AM, Lorena Pantano wrote: > Hi, > > I don't have any reference to give you. > > But my experience says that you don't get necessary a good heatmap > separated by two conditions although you use only DE genes. Probably > because many time,s results from DE genes are not so strong to separate the > two groups, or because there is a systematically outlier in your comparison > and get DE genes that are not true, or any other reason. > > I can say that I have done more than 50 DE analysis, and only once, I got a > clear heatmap showing two groups. So, I guess there is something there. > > very interesting your initiative. > > cheers > > Lo > > > On Mon, Dec 9, 2013 at 2:19 PM, Aaron Mackey <ajmackey at="" gmail.com=""> wrote: > >> A colleague of mine is skeptical of my assertion that drawing sample-level >> PCA plots and/or clustered heatmaps based only on differentially expressed >> genes (DEGs) is a circular, self-fulfilling prophecy -- they assert that >> there's no guarantee samples will cluster by condition (despite the fact >> that the condition is exactly what drives selection of DEGs), and so hopes >> to use the observed clustering as further "evidence" of the condition >> effects. Rather than spend more time trying to explain statistical >> concepts, I was hoping to checkmate the argument with a nice Nature Methods >> review or somesuch. Any pointers? >> >> Thanks in advance, >> -Aaron >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Kevin Coombes ▴ 430

0

Entering edit mode

These papers don't show clustered heatmaps, but show the inflation of classification accuracy and survival discrimination in simulated no-signal data when using differentially expressed genes only. So if you consider your clustering as the classifier, they may be relevant: Simon RM, Subramanian J, Li M-C, Menezes S. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform. 2011 May 15;12(3):203?14. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003 Jan 1;95(1):14?8. On Mon, Dec 9, 2013 at 9:05 AM, Kevin Coombes <kevin.r.coombes at="" gmail.com=""> wrote: > I don't have a good reference either. > > But you can easily simulate matrices full of IID standard normal data, pick > the "most differentially expressed" and show that this noise/nonsense > perfectly separates any two "groups" that you want to pretend is present in > the data. > > -- Kevin > > > On 12/9/2013 8:55 AM, Lorena Pantano wrote: >> >> Hi, >> >> I don't have any reference to give you. >> >> But my experience says that you don't get necessary a good heatmap >> separated by two conditions although you use only DE genes. Probably >> because many time,s results from DE genes are not so strong to separate >> the >> two groups, or because there is a systematically outlier in your >> comparison >> and get DE genes that are not true, or any other reason. >> >> I can say that I have done more than 50 DE analysis, and only once, I got >> a >> clear heatmap showing two groups. So, I guess there is something there. >> >> very interesting your initiative. >> >> cheers >> >> Lo >> >> >> On Mon, Dec 9, 2013 at 2:19 PM, Aaron Mackey <ajmackey at="" gmail.com=""> wrote: >> >>> A colleague of mine is skeptical of my assertion that drawing >>> sample-level >>> PCA plots and/or clustered heatmaps based only on differentially >>> expressed >>> genes (DEGs) is a circular, self-fulfilling prophecy -- they assert that >>> there's no guarantee samples will cluster by condition (despite the fact >>> that the condition is exactly what drives selection of DEGs), and so >>> hopes >>> to use the observed clustering as further "evidence" of the condition >>> effects. Rather than spend more time trying to explain statistical >>> concepts, I was hoping to checkmate the argument with a nice Nature >>> Methods >>> review or somesuch. Any pointers? >>> >>> Thanks in advance, >>> -Aaron >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Levi Waldron ★ 1.1k

0

Entering edit mode

Malcolm Cook ★ 1.6k

@malcolm-cook-6293

Last seen 11 weeks ago

United States

Aaron, Nor do I have this, but I agree with your assertion. Nonetheless, I wonder, on this topic.... Have you done either on ALL (not just DE) genes? If so, do your replicates cluster? Further, if so, do the distances between replicate clusters scale in any interesting way with condition (i.e. higher dose or better knockdown or longer exposure -> further away from untreated). I think this can be taken as "evidence" for condition effects that you and your colleague should expect. Do you agree with this? I'm curious as to this esp as I have submitted such as supplemental figures in the past.... Cheers, Malcolm >-----Original Message----- >From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Aaron Mackey >Sent: Monday, December 09, 2013 7:19 AM >To: Bioconductor mailing list >Subject: [BioC] scholarly reference for "don't draw PCA/heatmap dendrograms on DEGs" > >A colleague of mine is skeptical of my assertion that drawing sample-level >PCA plots and/or clustered heatmaps based only on differentially expressed >genes (DEGs) is a circular, self-fulfilling prophecy -- they assert that >there's no guarantee samples will cluster by condition (despite the fact >that the condition is exactly what drives selection of DEGs), and so hopes >to use the observed clustering as further "evidence" of the condition >effects. Rather than spend more time trying to explain statistical >concepts, I was hoping to checkmate the argument with a nice Nature Methods >review or somesuch. Any pointers? > >Thanks in advance, >-Aaron > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.4 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

On Mon, Dec 9, 2013 at 10:38 AM, Cook, Malcolm <mec@stowers.org> wrote: > Have you done either on ALL (not just DE) genes? If so, do your > replicates cluster? Further, if so, do the distances between replicate > clusters scale in any interesting way with condition (i.e. higher dose or > better knockdown or longer exposure -> further away from untreated). I > think this can be taken as "evidence" for condition effects that you and > your colleague should expect. Do you agree with this? In my experience, I do occassionally see "global" (all genes) clustering in (*scaled* and centered) PCA that corresponds to experimental conditions; and in such cases I will also find a vast multitude of DEGs (and also brings up the spectre of whether the usual between-sample normalization assumptions are being violated, and whether there may be unequal variances between groups). Or to consider the situation a different way, when a small number of DEGs exhibit a very large magnitude of variance, then an *unscaled* global PCA may also show experimental clustering (again, just driven by the variance of those DEGs). FYI, there are methods (such as implemented in the superpc package) that use the PCA loadings of PCs correlated to experimental design to select DEGs. It's all quite circular. Either way, the presence/absence of sample clustering in PCA does not provide any more/less independent evidence of treatment effects not already captured by the DEGs themselves, and so I usually argue that such "DEG-focused" PCA representations are not particularly informative (or at least no more informative than some representation of the DEGs themselves). We use the global PCA for QC discovery/confirmation of sample outliers, non-experimental batch effects, etc., but not for evaluation of the experimental axes of interest. -Aaron [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Aaron Mackey ▴ 200

0

Entering edit mode

Hey Aaron, you can show this fairly easily with a couple of lines of code (using randomly generated data). I think Kevin suggested something like this too: library(genefilter) mat <- rnorm(100000) # generate a 10 x 10,000 matrix of random "gene expression" data dim(mat) <- c(10000, 10) myfac <- factor(c(rep("a", 5), rep("b", 5))) tOut <- rowttests(mat, myfac) sigInd <- order(tOut[,3])[1:1000] pcOut <- prcomp(t(mat[sigInd, ]))$x # only plot PCA using top 1000 "differentially expressed" genes plot(pcOut, col=myfac) Paul On Mon, Dec 9, 2013 at 10:18 AM, Aaron Mackey <ajmackey at="" gmail.com=""> wrote: > On Mon, Dec 9, 2013 at 10:38 AM, Cook, Malcolm <mec at="" stowers.org=""> wrote: > >> Have you done either on ALL (not just DE) genes? If so, do your >> replicates cluster? Further, if so, do the distances between replicate >> clusters scale in any interesting way with condition (i.e. higher dose or >> better knockdown or longer exposure -> further away from untreated). I >> think this can be taken as "evidence" for condition effects that you and >> your colleague should expect. Do you agree with this? > > > In my experience, I do occassionally see "global" (all genes) clustering in > (*scaled* and centered) PCA that corresponds to experimental conditions; > and in such cases I will also find a vast multitude of DEGs (and also > brings up the spectre of whether the usual between-sample normalization > assumptions are being violated, and whether there may be unequal variances > between groups). Or to consider the situation a different way, when a > small number of DEGs exhibit a very large magnitude of variance, then an > *unscaled* global PCA may also show experimental clustering (again, just > driven by the variance of those DEGs). FYI, there are methods (such as > implemented in the superpc package) that use the PCA loadings of PCs > correlated to experimental design to select DEGs. It's all quite circular. > > Either way, the presence/absence of sample clustering in PCA does not > provide any more/less independent evidence of treatment effects not already > captured by the DEGs themselves, and so I usually argue that such > "DEG-focused" PCA representations are not particularly informative (or at > least no more informative than some representation of the DEGs themselves). > We use the global PCA for QC discovery/confirmation of sample outliers, > non-experimental batch effects, etc., but not for evaluation of the > experimental axes of interest. > > -Aaron > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dr. Paul Geeleher, PhD Section of Hematology-Oncology Department of Medicine The University of Chicago 900 E. 57th St., KCBD, Room 7144 Chicago, IL 60637 -- www.bioinformaticstutorials.com

ADD REPLY • link 11.4 years ago Paul Geeleher ★ 1.3k

0

Entering edit mode

I second the comment made by Aaron. I use PCA only for QC discovery and exploration prior to any testing for DEGs. I also agree with the original comment about it being a self- fulfilling prophecy (not that it always turns out that clearly). Anyone thinking that PCA plots based on DE features (obtained from the same data set) are going to confirm their findings is optimistically biased at best. Wade -----Original Message----- From: Aaron Mackey [mailto:ajmackey@gmail.com] Sent: Monday, December 09, 2013 10:18 AM To: Cook, Malcolm Cc: Bioconductor mailing list Subject: Re: [BioC] scholarly reference for "don't draw PCA/heatmap dendrograms on DEGs" On Mon, Dec 9, 2013 at 10:38 AM, Cook, Malcolm <mec at="" stowers.org=""> wrote: > Have you done either on ALL (not just DE) genes? If so, do your > replicates cluster? Further, if so, do the distances between > replicate clusters scale in any interesting way with condition (i.e. higher dose or > better knockdown or longer exposure -> further away from untreated). I > think this can be taken as "evidence" for condition effects that you > and your colleague should expect. Do you agree with this? In my experience, I do occassionally see "global" (all genes) clustering in (*scaled* and centered) PCA that corresponds to experimental conditions; and in such cases I will also find a vast multitude of DEGs (and also brings up the spectre of whether the usual between- sample normalization assumptions are being violated, and whether there may be unequal variances between groups). Or to consider the situation a different way, when a small number of DEGs exhibit a very large magnitude of variance, then an *unscaled* global PCA may also show experimental clustering (again, just driven by the variance of those DEGs). FYI, there are methods (such as implemented in the superpc package) that use the PCA loadings of PCs correlated to experimental design to select DEGs. It's all quite circular. Either way, the presence/absence of sample clustering in PCA does not provide any more/less independent evidence of treatment effects not already captured by the DEGs themselves, and so I usually argue that such "DEG-focused" PCA representations are not particularly informative (or at least no more informative than some representation of the DEGs themselves). We use the global PCA for QC discovery/confirmation of sample outliers, non-experimental batch effects, etc., but not for evaluation of the experimental axes of interest. -Aaron [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Davis, Wade ▴ 350

Login before adding your answer.