PreFiltering probe in microarray analysis
4
0
Entering edit mode
@stephanie-pierson-4671
Last seen 10.2 years ago
Hello everybody, I am a french student in bioinformatic. I have to analyze microarray data and I have some questions about prefiltering genes. The dataset that I have to analyze consist in 8 microarray, i have 4 times points and 2 replicats for each time point. Agilent's two color microarray (Whole Mouse Genome (4x44K) Oligo Microarrays) were used for the analysis. We are searching for genes that are differentially expressed between two conditions (for example C1 and C2) at the different time points and genes that are differentially expressed in one condition (C1 or C2) over time . I have chosen LIMMA to perform the statistical analysis because I read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC bioinformatic 2006,7/359 ) that it work better in experiment with few replicate per conditions. I perfom the statistical analysis on the whole data set ( more than 37 000 genes ), but I have high corrected p value after multiple testing correction (benjamini hochberg ). I would like to prefilter genes before statistical analysis, but I don't know how to do this. I read in Bourgon's paper that we can filter on the overall variance or on the overall mean, but in my case, with few replicates, how can I do ? In more, in this paper, it is not recommended to combine limma with a filtering procedure ... Someone can help me please ? Thank you, Best wishes St?phanie -- St?phanie PIERSON Universite de la Mediterranee (Aix-Marseille II) Master 2 Pro Bioinformatique et G?nomique
Microarray limma oligo Microarray limma oligo • 2.9k views
ADD COMMENT
0
Entering edit mode
Yuan Hao ▴ 240
@yuan-hao-3658
Last seen 10.2 years ago
United States
Hi Stephanie, You can have a look the 'genefilter' package in R/bioconductor. Basically, it's easy to set up a overall variance filter, for example if you have a data set normalized by gcrma and you require all probesets having an IQR bigger than 0.5, you can do: > library(affy) > library(genefilter) > library(gcrma) > eset <- gcrma(data) > f <- function(x)(IQR(x)>0.5) > selected <- genefilter (eset, f) > eset.filtered <- eset[selected, ] You may have to be careful about the filtering on your data. It quiet depends on the characters of your data. There is a paper[1] having had a very good review about this, which doesn't really recommend an overall variance filtering combined with Limma. Cheers, Yuan [1] R. Bourgon, R. Gentleman and W. Huber. PNAS 2010. p9546-9551 On 1 Jun 2011, at 13:58, Stephanie PIERSON wrote: > Hello everybody, > > I am a french student in bioinformatic. I have to analyze microarray > data and I have some questions about prefiltering genes. > The dataset that I have to analyze consist in 8 microarray, i have 4 > times points and 2 replicats for each time point. Agilent's two > color microarray (Whole Mouse Genome (4x44K) Oligo Microarrays) > were used for the analysis. We are searching for genes that are > differentially expressed between two conditions (for example C1 and > C2) at the different time points and genes that are differentially > expressed in one condition (C1 or C2) over time . > I have chosen LIMMA to perform the statistical analysis because I > read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC > bioinformatic 2006,7/359 ) that it work better in experiment with > few replicate per conditions. > I perfom the statistical analysis on the whole data set ( more than > 37 000 genes ), but I have high corrected p value after multiple > testing correction (benjamini hochberg ). I would like to prefilter > genes before statistical analysis, but I don't know how to do this. > I read in Bourgon's paper that we can filter on the overall variance > or on the overall mean, but in my case, with few replicates, how can > I do ? In more, in this paper, it is not recommended to combine > limma with a filtering procedure ... > Someone can help me please ? > > Thank you, > Best wishes > St?phanie > > > > -- > St?phanie PIERSON > Universite de la Mediterranee (Aix-Marseille II) > Master 2 Pro Bioinformatique et G?nomique > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
I have seen in this mail list many questions like "after I applied the multiple test, I got 0 number of differentially expressed gene". The suggested solution is always the "gene prefiltering". I disagree with this old idea and proposed a new idea EDR which does not need gene prefiltering. http://www.ncbi.nlm.nih.gov/pubmed/20846437 However, the old idea is hard to be shaken because it has been accepted by people for a long time in microarray and now in RNAseq as well, and the new idea needs time to be recognized. Here is an intuitive scenario, we assume that the raw pvalues and the top lowest-pvalue genes are the same before (35k genes) and after gene filtering (5k genes), the gene x you selected from 35K versus the one selected from 5K, which is more sound? In other words, the best student selected from 1000 students versus the best student selected from 100, which is more sound? Wayne -- > Hi Stephanie, > > You can have a look the 'genefilter' package in R/bioconductor. > Basically, it's easy to set up a overall variance filter, for example > if you have a data set normalized by gcrma and you require all > probesets having an IQR bigger than 0.5, you can do: > > > library(affy) > > library(genefilter) > > library(gcrma) > > eset <- gcrma(data) > > f <- function(x)(IQR(x)>0.5) > > selected <- genefilter (eset, f) > > eset.filtered <- eset[selected, ] > > You may have to be careful about the filtering on your data. It quiet > depends on the characters of your data. There is a paper[1] having had > a very good review about this, which doesn't really recommend an > overall variance filtering combined with Limma. > > Cheers, > Yuan > > [1] R. Bourgon, R. Gentleman and W. Huber. PNAS 2010. p9546-9551 > > On 1 Jun 2011, at 13:58, Stephanie PIERSON wrote: > >> Hello everybody, >> >> I am a french student in bioinformatic. I have to analyze microarray >> data and I have some questions about prefiltering genes. >> The dataset that I have to analyze consist in 8 microarray, i have 4 >> times points and 2 replicats for each time point. Agilent's two >> color microarray (Whole Mouse Genome (4x44K) Oligo Microarrays) >> were used for the analysis. We are searching for genes that are >> differentially expressed between two conditions (for example C1 and >> C2) at the different time points and genes that are differentially >> expressed in one condition (C1 or C2) over time . >> I have chosen LIMMA to perform the statistical analysis because I >> read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC >> bioinformatic 2006,7/359 ) that it work better in experiment with >> few replicate per conditions. >> I perfom the statistical analysis on the whole data set ( more than >> 37 000 genes ), but I have high corrected p value after multiple >> testing correction (benjamini hochberg ). I would like to prefilter >> genes before statistical analysis, but I don't know how to do this. >> I read in Bourgon's paper that we can filter on the overall variance >> or on the overall mean, but in my case, with few replicates, how can >> I do ? In more, in this paper, it is not recommended to combine >> limma with a filtering procedure ... >> Someone can help me please ? >> >> Thank you, >> Best wishes >> St?phanie >> >> >> >> -- >> St?phanie PIERSON >> Universite de la Mediterranee (Aix-Marseille II) >> Master 2 Pro Bioinformatique et G?nomique >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Swapna Menon ▴ 50
@swapna-menon-4082
Last seen 10.2 years ago
Hi Stephanie, There is another recent paper that you might consider which also cautions about filtering Van Iterson, M., Boer, J. M., & Menezes, R. X. (2010). Filtering, FDR and power. BMC Bioinformatics, 11(1), 450. They also recommend their own statistical test to see if one's filter biases FDR. currently I am trying variance filter and feature filter from genefilter package: try ?nsFilter for help on these functions. However, I dont use filtering routinely since choosing the right filter , parameters and testing the effects of any bias are things I have not worked out in addition to having read Bourgon et al and Iterson et al and others that discuss this issue. About your limma results, while conventional filtering may be expected to increase the number of significant genes, as the papers suggest likelihood of false positives also increases.In your current results, do you have high fold changes above 2 (log2>1)? You may want to explore the biological relevance of those genes with high FC and significant unadjusted p value. Best, Swapna > > Message: 2 > Date: Wed, 01 Jun 2011 14:58:47 +0200 > From: Stephanie PIERSON <stephanie.pierson at="" etumel.univmed.fr=""> > To: bioconductor at r-project.org > Subject: [BioC] PreFiltering probe in microarray analysis > Message-ID: <20110601145847.72402aos7e2ne074 at wmeletud2.univmed.fr> > Content-Type: text/plain; charset=ISO-8859-15; DelSp="Yes"; > ? ? ? ?format="flowed" > > Hello everybody, > > I am a french student in bioinformatic. I have to analyze microarray > data and I have some questions about prefiltering genes. > The dataset that I have to analyze consist in 8 microarray, i have 4 > times points and 2 replicats for each time point. Agilent's two color > microarray ?(Whole Mouse Genome (4x44K) Oligo Microarrays) were used > for the analysis. We are searching for genes that are differentially > expressed between two conditions (for example C1 and C2) at the > different time points and genes that are differentially expressed in > one condition (C1 or C2) over time . > I have chosen LIMMA to perform the statistical analysis because I read > in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC > bioinformatic 2006,7/359 ?) that it work better in experiment with few > replicate per conditions. > I perfom the statistical analysis on the whole data set ( more than 37 > 000 genes ), but I have high corrected p value after multiple testing > correction (benjamini hochberg ). I would like to prefilter genes > before statistical analysis, but I don't know how to do this. I read > in Bourgon's paper that we can filter on the overall variance or on > the overall mean, but in my case, with few replicates, how can I do ? > In more, in this paper, it is not recommended to combine limma with a > filtering procedure ... > Someone can help me please ? > > Thank you, > Best wishes > St?phanie > > > > -- > St?phanie PIERSON > Universite de la Mediterranee (Aix-Marseille II) > Master 2 Pro Bioinformatique et G?nomique > > > > ------------------------------ > > Message: 3 > Date: Wed, 1 Jun 2011 14:31:48 +0100 > From: Yuan Hao <yuan.x.hao at="" gmail.com=""> > To: Stephanie PIERSON <stephanie.pierson at="" etumel.univmed.fr=""> > Cc: bioconductor at r-project.org > Subject: Re: [BioC] PreFiltering probe in microarray analysis > Message-ID: <e177da51-396f-4bd3-a2c9-818c6fc2c76f at="" gmail.com=""> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes > > Hi Stephanie, > > You can have a look the 'genefilter' package in R/bioconductor. > Basically, it's easy to set up a overall variance filter, for example > if you have a data set normalized by gcrma and you require all > probesets having an IQR bigger than 0.5, you can do: > > ?> library(affy) > ?> library(genefilter) > ?> library(gcrma) > ?> eset <- gcrma(data) > ?> f <- function(x)(IQR(x)>0.5) > ?> selected <- genefilter (eset, f) > ?> eset.filtered <- eset[selected, ] > > You may have to be careful about the filtering on your data. It quiet > depends on the characters of your data. There is a paper[1] having had > a very good review about this, which doesn't really recommend an > overall variance filtering combined with Limma. > > Cheers, > Yuan > > [1] R. Bourgon, R. Gentleman and W. Huber. PNAS 2010. p9546-9551 > > On 1 Jun 2011, at 13:58, Stephanie PIERSON wrote: > >> Hello everybody, >> >> I am a french student in bioinformatic. I have to analyze microarray >> data and I have some questions about prefiltering genes. >> The dataset that I have to analyze consist in 8 microarray, i have 4 >> times points and 2 replicats for each time point. Agilent's two >> color microarray ?(Whole Mouse Genome (4x44K) Oligo Microarrays) >> were used for the analysis. We are searching for genes that are >> differentially expressed between two conditions (for example C1 and >> C2) at the different time points and genes that are differentially >> expressed in one condition (C1 or C2) over time . >> I have chosen LIMMA to perform the statistical analysis because I >> read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC >> bioinformatic 2006,7/359 ?) that it work better in experiment with >> few replicate per conditions. >> I perfom the statistical analysis on the whole data set ( more than >> 37 000 genes ), but I have high corrected p value after multiple >> testing correction (benjamini hochberg ). I would like to prefilter >> genes before statistical analysis, but I don't know how to do this. >> I read in Bourgon's paper that we can filter on the overall variance >> or on the overall mean, but in my case, with few replicates, how can >> I do ? In more, in this paper, it is not recommended to combine >> limma with a filtering procedure ... >> Someone can help me please ? >> >> Thank you, >> Best wishes >> St?phanie >> >> >> >> -- >> St?phanie PIERSON >> Universite de la Mediterranee (Aix-Marseille II) >> Master 2 Pro Bioinformatique et G?nomique >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > ------------------------------ > > Message: 4 > Date: Fri, 27 May 2011 21:59:41 +0000 > From: Ilya Lipkovich <ilya.lipkovich at="" gmail.com=""> > To: <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] R package modreg > Message-ID: <loom.20110527t235533-145 at="" post.gmane.org=""> > Content-Type: text/plain; charset="us-ascii" > > seems modreg has been merged into stats which is included in R > > > > ------------------------------ > > Message: 5 > Date: Mon, 30 May 2011 10:17:54 -0700 (PDT) > From: Les Dethlefsen <dethlefs at="" stanford.edu=""> > To: bioconductor at r-project.org > Subject: [BioC] question/issue with multtest mt.maxT > Message-ID: > ? ? ? ?<570911820.845739.1306775874471.JavaMail.root at zm08.stanford.edu> > Content-Type: text/plain; charset=utf-8 > > Hi All, > > I'm a biologist, not a statistician, and a bit of a newbie with R, so please forgive me if this is a stupid question, and thanks for your help. > > My data are from shotgun metagenomic sequencing of multiple samples, but it's essentially similar to microarray data. ?I have a matrix of 3312 functional gene categories by 24 samples, with real-valued entries indicating the proportion of interpretable gene sequence data for each sample that is assigned to each functional category. ?There are a number of ways to bin the samples into categories, based on some other analysis I'm most interested in a breakdown into two subjects and 5 temporal categories within each subject, i.e. a total of 10 categories. ?Two of the categories have 4 samples each, the remaining 8 categories have 2 samples each. > > When I ask mt.maxT to carry out a 2-sided F test on the 10 categories, I see something I don't understand: the adjusted p values are not a monotonic transformation of the raw p values. ?They are a monotonic with respect to the test statistic...as the test statistic falls, the adjusted p values rise, as expected. ?But the raw p values are bouncing around quite a bit, with the lowest raw p values of 0.0001 having a range of adjusted p values, all much larger than the minimum adjusted p value. ?There are 244 raw p values lower than that which corresponds to the minimum adjusted p value. > > My (perhaps mistaken) understanding was that permutations were used to generate the raw p values (which I definitely want, since I don't think the distributions of values for the different functional gene categories will follow any regular distribution or be similar to each other), and then some sort of adjustment was made to the raw values to adjust for multiple testing considerations, but the adjustment itself was not based on permutations. ?I would expect the permutations to be fairly noisy in my case with only 2 samples in most of my sample categories, but wouldn't that only affect the reliability of the raw p values? ?How can the adjusted p values have a different rank order than the raw p values? > > I did run the same command with the samples binned only to subject, i.e. only 2 categories of 12 subject each. While the range of p values is much lower (not surprisingly) there is still the same pattern with a monotonic relationship between the test statistic and the adjusted p values, but not between the raw p values and adjusted p values. ?Hence, it doesn't seem that this behavior is due to the messiness of 10 unequally sized categories for only 24 samples. > > I'm including some snips of my R session below; if anyone is interested in playing with the original data there's a compressed file at > > ftp://asiago.stanford.edu/multtest_issue.tgz > > that has the data matrix and the class labels. > > I'm running R 2.13.0 (on Mac OSX with the R.app GUI) with newly- installed BioConductor, so everything should be up to date. > > Thanks for any help or insight! > Les > > > Les Dethlefsen > Relman Lab > Microbiology & Immunology > Stanford University > > R version 2.13.0 (2011-04-13) > Copyright (C) 2011 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > [R.app GUI 1.40 (5751) x86_64-apple-darwin9.8.0] > >> ko.data.df=as.data.frame(read.table("K57.fvn.ko.txt",header=TRUE,se p="",row.names=1)) >> class.label.df=as.data.frame(t(read.table("ClassLabel2.txt",header= FALSE,sep="",row.names=1))) >> class.label.df > ? ?Sub CpNon Cp12Non CpPIP Cp12PIP Cp12PELP SubCp12PIP > V2 ? ?0 ? ? 0 ? ? ? 0 ? ? 0 ? ? ? 0 ? ? ? ?0 ? ? ? ? ?0 > V3 ? ?0 ? ? 0 ? ? ? 0 ? ? 0 ? ? ? 0 ? ? ? ?0 ? ? ? ? ?0 > V4 ? ?0 ? ? 1 ? ? ? 1 ? ? 1 ? ? ? 1 ? ? ? ?1 ? ? ? ? ?1 > V5 ? ?0 ? ? 1 ? ? ? 1 ? ? 1 ? ? ? 1 ? ? ? ?1 ? ? ? ? ?1 > V6 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?2 ? ? ? ? ?2 > V7 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?2 ? ? ? ? ?2 > V8 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?3 ? ? ? ? ?2 > V9 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?3 ? ? ? ? ?2 > V10 ? 0 ? ? 1 ? ? ? 2 ? ? 1 ? ? ? 3 ? ? ? ?4 ? ? ? ? ?3 > V11 ? 0 ? ? 1 ? ? ? 2 ? ? 1 ? ? ? 3 ? ? ? ?4 ? ? ? ? ?3 > V12 ? 0 ? ? 0 ? ? ? 0 ? ? 3 ? ? ? 4 ? ? ? ?5 ? ? ? ? ?4 > V13
ADD COMMENT
0
Entering edit mode
Hi Swapna Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: > Hi Stephanie, > There is another recent paper that you might consider which also > cautions about filtering > Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, FDR > and power. BMC Bioinformatics, 11(1), 450. > They also recommend their own statistical test to see if one's filter > biases FDR. > currently I am trying variance filter and feature filter from > genefilter package: try ?nsFilter for help on these functions. > However, I dont use filtering routinely since choosing the right > filter , parameters and testing the effects of any bias are things I > have not worked out in addition to having read Bourgon et al and > Iterson et al and others that discuss this issue. > About your limma results, while conventional filtering may be expected > to increase the number of significant genes, as the papers suggest > likelihood of false positives also increases. No. Properly applied filtering does not affect the false positive rates (FWER or FDR). That's the whole point of it. [1] If one is willing to put up with higher rate or probability of false discoveries, then don't do filtering - just increase the p-value cutoff. [1] Bourgon et al., PNAS 2010. > In your current results, > do you have high fold changes above 2 (log2>1)? You may want to > explore the biological relevance of those genes with high FC and > significant unadjusted p value. > Best, > Swapna Best wishes Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD REPLY
0
Entering edit mode
Hi, Dear Wolfgang, I think it would nice to bring up a discussion here about the gene prefiltering issue. Please point me out if this suggestion is inappropriate. There are two questions in the gene filtering which I could not find answers: 1). In the traditional multiple tests to correct the p-values of many test groups for example, in a new drug effect experiment, is it appropriate to remove some group tests from the whole experiment? If not, why can we prefilter the genes? 2). As I stated in the previous email, we assume that the raw pvalues and the top lowest-pvalue genes are the same before (35k genes) and after gene filtering (5k genes), the gene x you selected from 35K versus the one selected from 5K, which is more sound? In other words, the best student selected from 1000 students versus the best student selected from 100, which is more sound? So this is a question of the whole point of gene prefiltering approach. Best wishes, Wayne -- > Hi Swapna > > Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >> Hi Stephanie, >> There is another recent paper that you might consider which also >> cautions about filtering >> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, FDR >> and power. BMC Bioinformatics, 11(1), 450. >> They also recommend their own statistical test to see if one's filter >> biases FDR. >> currently I am trying variance filter and feature filter from >> genefilter package: try ?nsFilter for help on these functions. >> However, I dont use filtering routinely since choosing the right >> filter , parameters and testing the effects of any bias are things I >> have not worked out in addition to having read Bourgon et al and >> Iterson et al and others that discuss this issue. >> About your limma results, while conventional filtering may be expected >> to increase the number of significant genes, as the papers suggest >> likelihood of false positives also increases. > > No. Properly applied filtering does not affect the false positive rates > (FWER or FDR). That's the whole point of it. [1] > > If one is willing to put up with higher rate or probability of false > discoveries, then don't do filtering - just increase the p-value cutoff. > > [1] Bourgon et al., PNAS 2010. > >> In your current results, >> do you have high fold changes above 2 (log2>1)? You may want to >> explore the biological relevance of those genes with high FC and >> significant unadjusted p value. >> Best, >> Swapna > > Best wishes > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Speaking as a pure 'biologist', I think it's OK to pre-filter genes as long you know the pitfalls, in terms of the potential bias and affect on FDRs. I am personally aware of people pre-filtering not only to enhance the FDR, but to use the results of a t-test as a starting point for a second sequential t-test because the FDRs from this test are 'amazingly good'. However statistically sacrilegious this is, the top 10 genes are always going to be the same top 10 genes, so if you are just looking for the top 10 genes, this is essentially OK. How does that hang with you guys? Matt ---------------------- Matthew Arno, Ph.D. Genomics Centre Manager King's College London ? The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals. This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender. >-----Original Message----- >From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r- >project.org] On Behalf Of wxu at msi.umn.edu >Sent: 12 June 2011 16:41 >To: Wolfgang Huber >Cc: bioconductor at r-project.org >Subject: Re: [BioC] PreFiltering probe in microarray analysis > >Hi, Dear Wolfgang, > >I think it would nice to bring up a discussion here about the gene >prefiltering issue. Please point me out if this suggestion is >inappropriate. > >There are two questions in the gene filtering which I could not find >answers: >1). In the traditional multiple tests to correct the p-values of many >test >groups for example, in a new drug effect experiment, is it appropriate >to >remove some group tests from the whole experiment? If not, why can we >prefilter the genes? >2). As I stated in the previous email, we assume that the raw pvalues >and >the top lowest-pvalue genes are the same before (35k genes) and after >gene >filtering (5k genes), the gene x you selected from 35K versus the one >selected from 5K, which is more sound? In other words, the best student >selected from 1000 students versus the best student selected from 100, >which is more sound? > >So this is a question of the whole point of gene prefiltering approach. > >Best wishes, > >Wayne >-- >> Hi Swapna >> >> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>> Hi Stephanie, >>> There is another recent paper that you might consider which also >>> cautions about filtering >>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, FDR >>> and power. BMC Bioinformatics, 11(1), 450. >>> They also recommend their own statistical test to see if one's filter >>> biases FDR. >>> currently I am trying variance filter and feature filter from >>> genefilter package: try ?nsFilter for help on these functions. >>> However, I dont use filtering routinely since choosing the right >>> filter , parameters and testing the effects of any bias are things I >>> have not worked out in addition to having read Bourgon et al and >>> Iterson et al and others that discuss this issue. >>> About your limma results, while conventional filtering may be >expected >>> to increase the number of significant genes, as the papers suggest >>> likelihood of false positives also increases. >> >> No. Properly applied filtering does not affect the false positive >rates >> (FWER or FDR). That's the whole point of it. [1] >> >> If one is willing to put up with higher rate or probability of false >> discoveries, then don't do filtering - just increase the p-value >cutoff. >> >> [1] Bourgon et al., PNAS 2010. >> >>> In your current results, >>> do you have high fold changes above 2 (log2>1)? You may want to >>> explore the biological relevance of those genes with high FC and >>> significant unadjusted p value. >>> Best, >>> Swapna >> >> Best wishes >> Wolfgang Huber >> EMBL >> http://www.embl.de/research/units/genome_biology/huber >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Mon, Jun 13, 2011 at 6:55 AM, Arno, Matthew <matthew.arno at="" kcl.ac.uk=""> wrote: > Speaking as a pure 'biologist', I think it's OK to pre-filter genes as long you know the pitfalls, in terms of the potential bias and affect on FDRs. I am personally aware of people pre-filtering not only to enhance the FDR, but to use the results of a t-test as a starting point for a second sequential t-test because the FDRs from this test are 'amazingly good'. > > However statistically sacrilegious this is, the top 10 genes are always going to be the same top 10 genes, so if you are just looking for the top 10 genes, this is essentially OK. > > How does that hang with you guys? Hi, Matt. See Keith Baggerly's work or a review (http://www.ncbi.nlm.nih.gov/pubmed/17227998 ) by Rich Simon for some evidence of how important it is to stay on the straight-and-narrow and how easy it is to go astray. To be a bit clearer, in the words of Nike, "Just [don't] do it". Sean > Matt > > ---------------------- > Matthew Arno, Ph.D. > Genomics Centre Manager > King's College London > > The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals. > This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender. > > > >>-----Original Message----- >>From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r- >>project.org] On Behalf Of wxu at msi.umn.edu >>Sent: 12 June 2011 16:41 >>To: Wolfgang Huber >>Cc: bioconductor at r-project.org >>Subject: Re: [BioC] PreFiltering probe in microarray analysis >> >>Hi, Dear Wolfgang, >> >>I think it would nice to bring up a discussion here about the gene >>prefiltering issue. Please point me out if this suggestion is >>inappropriate. >> >>There are two questions in the gene filtering which I could not find >>answers: >>1). In the traditional multiple tests to correct the p-values of many >>test >>groups for example, in a new drug effect experiment, is it appropriate >>to >>remove some group tests from the whole experiment? If not, why can we >>prefilter the genes? >>2). As I stated in the previous email, we assume that the raw pvalues >>and >>the top lowest-pvalue genes are the same before (35k genes) and after >>gene >>filtering (5k genes), the gene x you selected from 35K versus the one >>selected from 5K, which is more sound? In other words, the best student >>selected from 1000 students versus the best student selected from 100, >>which is more sound? >> >>So this is a question of the whole point of gene prefiltering approach. >> >>Best wishes, >> >>Wayne >>-- >>> Hi Swapna >>> >>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>> Hi Stephanie, >>>> There is another recent paper that you might consider which also >>>> cautions about filtering >>>> Van Iterson, M., Boer, J. M.,& ?Menezes, R. X. (2010). Filtering, FDR >>>> and power. BMC Bioinformatics, 11(1), 450. >>>> They also recommend their own statistical test to see if one's filter >>>> biases FDR. >>>> currently I am trying variance filter and feature filter from >>>> genefilter package: try ?nsFilter for help on these functions. >>>> However, I dont use filtering routinely since choosing the right >>>> filter , parameters and testing the effects of any bias are things I >>>> have not worked out in addition to having read Bourgon et al and >>>> Iterson et al and others that discuss this issue. >>>> About your limma results, while conventional filtering may be >>expected >>>> to increase the number of significant genes, as the papers suggest >>>> likelihood of false positives also increases. >>> >>> No. Properly applied filtering does not affect the false positive >>rates >>> (FWER or FDR). That's the whole point of it. [1] >>> >>> If one is willing to put up with higher rate or probability of false >>> discoveries, then don't do filtering - just increase the p-value >>cutoff. >>> >>> [1] Bourgon et al., PNAS 2010. >>> >>>> In your current results, >>>> do you have high fold changes above 2 (log2>1)? ?You may want to >>>> explore the biological relevance of those genes with high FC and >>>> significant unadjusted p value. >>>> Best, >>>> Swapna >>> >>> Best wishes >>> Wolfgang Huber >>> EMBL >>> http://www.embl.de/research/units/genome_biology/huber >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at r-project.org >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
@wxumsiumnedu-1819
Last seen 10.2 years ago
Thanks, Matt, for joining this discussion, It is true from Biologist point of view. You always get the top 10 genes no matter filtering or not. But this shifts to another question, the 'amazingly good FDR'. For the same top ten gene, people can report different FDRs by filtering or no filtering, or by filtering a different number of genes. These FDRs in different reports are not comparable at all. Does this FDR make sense? People can try to make it amazing good. Does that sound a little 'cheating'? Sorry I do not mean a real cheating here. Do you have any thought about this ? Best wishes, Wayne -- > Speaking as a pure 'biologist', I think it's OK to pre-filter genes as > long you know the pitfalls, in terms of the potential bias and affect on > FDRs. I am personally aware of people pre-filtering not only to enhance > the FDR, but to use the results of a t-test as a starting point for a > second sequential t-test because the FDRs from this test are 'amazingly > good'. > > However statistically sacrilegious this is, the top 10 genes are always > going to be the same top 10 genes, so if you are just looking for the top > 10 genes, this is essentially OK. > > How does that hang with you guys? > > Matt > > ---------------------- > Matthew Arno, Ph.D. > Genomics Centre Manager > King's College London > ? > The contents of this email are strictly confidential. It may not be > transmitted in part or in whole to any other individual or groups of > individuals. > This email is intended solely for the use of the individual(s) to whom > they are addressed and should not be released to any third party without > the consent of the sender. > > > >>-----Original Message----- >>From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r- >>project.org] On Behalf Of wxu at msi.umn.edu >>Sent: 12 June 2011 16:41 >>To: Wolfgang Huber >>Cc: bioconductor at r-project.org >>Subject: Re: [BioC] PreFiltering probe in microarray analysis >> >>Hi, Dear Wolfgang, >> >>I think it would nice to bring up a discussion here about the gene >>prefiltering issue. Please point me out if this suggestion is >>inappropriate. >> >>There are two questions in the gene filtering which I could not find >>answers: >>1). In the traditional multiple tests to correct the p-values of many >>test >>groups for example, in a new drug effect experiment, is it appropriate >>to >>remove some group tests from the whole experiment? If not, why can we >>prefilter the genes? >>2). As I stated in the previous email, we assume that the raw pvalues >>and >>the top lowest-pvalue genes are the same before (35k genes) and after >>gene >>filtering (5k genes), the gene x you selected from 35K versus the one >>selected from 5K, which is more sound? In other words, the best student >>selected from 1000 students versus the best student selected from 100, >>which is more sound? >> >>So this is a question of the whole point of gene prefiltering approach. >> >>Best wishes, >> >>Wayne >>-- >>> Hi Swapna >>> >>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>> Hi Stephanie, >>>> There is another recent paper that you might consider which also >>>> cautions about filtering >>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, FDR >>>> and power. BMC Bioinformatics, 11(1), 450. >>>> They also recommend their own statistical test to see if one's filter >>>> biases FDR. >>>> currently I am trying variance filter and feature filter from >>>> genefilter package: try ?nsFilter for help on these functions. >>>> However, I dont use filtering routinely since choosing the right >>>> filter , parameters and testing the effects of any bias are things I >>>> have not worked out in addition to having read Bourgon et al and >>>> Iterson et al and others that discuss this issue. >>>> About your limma results, while conventional filtering may be >>expected >>>> to increase the number of significant genes, as the papers suggest >>>> likelihood of false positives also increases. >>> >>> No. Properly applied filtering does not affect the false positive >>rates >>> (FWER or FDR). That's the whole point of it. [1] >>> >>> If one is willing to put up with higher rate or probability of false >>> discoveries, then don't do filtering - just increase the p-value >>cutoff. >>> >>> [1] Bourgon et al., PNAS 2010. >>> >>>> In your current results, >>>> do you have high fold changes above 2 (log2>1)? You may want to >>>> explore the biological relevance of those genes with high FC and >>>> significant unadjusted p value. >>>> Best, >>>> Swapna >>> >>> Best wishes >>> Wolfgang Huber >>> EMBL >>> http://www.embl.de/research/units/genome_biology/huber >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at r-project.org >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Wayne - I *definitely* mean cheating! It depends on whether the FDR is reported I suppose. Let's say you do a microarray screen and the 'most changed' gene that comes up (either by largest fold change or smallest t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on to validate the change (on the same samples and further test sets) using qPCR and or western blots etc., if you go as far as protein analysis. Therefore you can analyse the importance of that single gene in a real biological context. No one could argue that the gene is not changed in the study and other samples, because of the low-throughput validation, and it makes a nice biological story for a paper. This is regardless of the arrays used, the test used, the FDR or actual p-value even. You could have picked the gene by sticking a pin in a list; you just used an array to make that pin stick more likely to give a real change. However, the statistical factors do definitely matter when you are trying to report an overall analysis with lots of genes/patterns/pathways/functions etc, with a wide range of conclusions, perhaps in the absence of being able to perform a high- throughput validation of every gene (or a proportion of) in the final 'significant' list. I can see it from both sides...however, sometimes it's easy to lose sight that an array hybridisation is just a hypothesis generator, not a hypothesis solver. That said any attempt to standardise this sort of reporting must have parity and (importantly) transparency with all these factors to have any success. I don't actually think there is a single valid answer to this issue, as there are so many interpretations/angles; it's just interesting to see how the debate changes over time. And essential to keep having it too! Thanks for reading - I have lots of thoughts about this! Matt ---------------------- Matthew Arno, Ph.D. Genomics Centre Manager King's College London ? The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals. This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender. >-----Original Message----- >From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >Sent: 13 June 2011 14:14 >To: Arno, Matthew >Cc: bioconductor at r-project.org >Subject: Re: [BioC] PreFiltering probe in microarray analysis > >Thanks, Matt, for joining this discussion, > >It is true from Biologist point of view. You always get the top 10 genes >no matter filtering or not. But this shifts to another question, the >'amazingly good FDR'. For the same top ten gene, people can report >different FDRs by filtering or no filtering, or by filtering a different >number of genes. These FDRs in different reports are not comparable at >all. Does this FDR make sense? People can try to make it amazing good. >Does that sound a little 'cheating'? Sorry I do not mean a real cheating >here. > >Do you have any thought about this ? > >Best wishes, > >Wayne >-- > > > >> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as >> long you know the pitfalls, in terms of the potential bias and affect >on >> FDRs. I am personally aware of people pre-filtering not only to >enhance >> the FDR, but to use the results of a t-test as a starting point for a >> second sequential t-test because the FDRs from this test are >'amazingly >> good'. >> >> However statistically sacrilegious this is, the top 10 genes are >always >> going to be the same top 10 genes, so if you are just looking for the >top >> 10 genes, this is essentially OK. >> >> How does that hang with you guys? >> >> Matt >> >> ---------------------- >> Matthew Arno, Ph.D. >> Genomics Centre Manager >> King's College London >> >> The contents of this email are strictly confidential. It may not be >> transmitted in part or in whole to any other individual or groups of >> individuals. >> This email is intended solely for the use of the individual(s) to whom >> they are addressed and should not be released to any third party >without >> the consent of the sender. >> >> >> >>>-----Original Message----- >>>From: bioconductor-bounces at r-project.org [mailto:bioconductor- >bounces at r- >>>project.org] On Behalf Of wxu at msi.umn.edu >>>Sent: 12 June 2011 16:41 >>>To: Wolfgang Huber >>>Cc: bioconductor at r-project.org >>>Subject: Re: [BioC] PreFiltering probe in microarray analysis >>> >>>Hi, Dear Wolfgang, >>> >>>I think it would nice to bring up a discussion here about the gene >>>prefiltering issue. Please point me out if this suggestion is >>>inappropriate. >>> >>>There are two questions in the gene filtering which I could not find >>>answers: >>>1). In the traditional multiple tests to correct the p-values of many >>>test >>>groups for example, in a new drug effect experiment, is it appropriate >>>to >>>remove some group tests from the whole experiment? If not, why can we >>>prefilter the genes? >>>2). As I stated in the previous email, we assume that the raw pvalues >>>and >>>the top lowest-pvalue genes are the same before (35k genes) and after >>>gene >>>filtering (5k genes), the gene x you selected from 35K versus the one >>>selected from 5K, which is more sound? In other words, the best >student >>>selected from 1000 students versus the best student selected from 100, >>>which is more sound? >>> >>>So this is a question of the whole point of gene prefiltering >approach. >>> >>>Best wishes, >>> >>>Wayne >>>-- >>>> Hi Swapna >>>> >>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>> Hi Stephanie, >>>>> There is another recent paper that you might consider which also >>>>> cautions about filtering >>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, >FDR >>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>> They also recommend their own statistical test to see if one's >filter >>>>> biases FDR. >>>>> currently I am trying variance filter and feature filter from >>>>> genefilter package: try ?nsFilter for help on these functions. >>>>> However, I dont use filtering routinely since choosing the right >>>>> filter , parameters and testing the effects of any bias are things >I >>>>> have not worked out in addition to having read Bourgon et al and >>>>> Iterson et al and others that discuss this issue. >>>>> About your limma results, while conventional filtering may be >>>expected >>>>> to increase the number of significant genes, as the papers suggest >>>>> likelihood of false positives also increases. >>>> >>>> No. Properly applied filtering does not affect the false positive >>>rates >>>> (FWER or FDR). That's the whole point of it. [1] >>>> >>>> If one is willing to put up with higher rate or probability of false >>>> discoveries, then don't do filtering - just increase the p-value >>>cutoff. >>>> >>>> [1] Bourgon et al., PNAS 2010. >>>> >>>>> In your current results, >>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>> explore the biological relevance of those genes with high FC and >>>>> significant unadjusted p value. >>>>> Best, >>>>> Swapna >>>> >>>> Best wishes >>>> Wolfgang Huber >>>> EMBL >>>> http://www.embl.de/research/units/genome_biology/huber >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor at r-project.org >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >>>Search the archives: >>>http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY
0
Entering edit mode
Dear Matt, I read your email again. Since you have lots of thoughts about this issue, I guess you probably have also thought a lot about the solutions. Hope my continuing followup is not boring. Please point out if I am wrong in my words. There is no question (actually less questions) about the experiment result such as RT-PCR result of the differentially expressed gene detection. However, when we test many genes in microarray or RNAseq, we do need something like FDR to control how many genes we are going to report. Eeven thought this FDR is not "absolutely true false discovery rate", it can work as a relative controller. The point is when different people use the same FDR method the FDR reports should be comparable. Usually people will not do gene prefiltering first, and do it only when they find the FDR is too high. If you report a gene list with very high FDR, the reviewers will reject the paper. Therefore people try to make an amazing good FDR by gene prefiltering. The same gene list that had a high FDR before the gene prefiltering now has a lower FDR. Then the reviewers would be happy with the good FDR. It seems, in some cases," with this FDR method, we have to do gene prefiltering in order to get a good FDR". We can see here that there are two problems. One is the FDR method itself, and the other is the gene prefiltering approach. Having thought a lot about these problems, I came out a solution called EDR in which I have addressed these problems: http://www.ncbi.nlm.nih.gov/pubmed/20846437 Have you read this paper? Do you think that could be one of the standardized solutions? or any comments would be appreciated, Best wishes, Wayne -- ---------------------------------------------------------------------- - Wayne Xu, Ph.D Computational Genomics Specialist Supercomputing Institute for Advanced Computational Research 550 Walter Library 117 Pleasant Street SE University of Minnesota Minneapolis, Minnesota 55455 email: wxu at msi.umn.edu help email: help at msi.umn.edu phone: 612-624-1447 help phone: 612-626-0802 fax: 612-624-8861 ---------------------------------------------------------------------- - --On 6/13/2011 9:01 AM, Arno, Matthew wrote: > Wayne - I *definitely* mean cheating! It depends on whether the FDR is reported I suppose. Let's say you do a microarray screen and the 'most changed' gene that comes up (either by largest fold change or smallest t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on to validate the change (on the same samples and further test sets) using qPCR and or western blots etc., if you go as far as protein analysis. Therefore you can analyse the importance of that single gene in a real biological context. No one could argue that the gene is not changed in the study and other samples, because of the low-throughput validation, and it makes a nice biological story for a paper. This is regardless of the arrays used, the test used, the FDR or actual p-value even. You could have picked the gene by sticking a pin in a list; you just used an array to make that pin stick more likely to give a real change. > > However, the statistical factors do definitely matter when you are trying to report an overall analysis with lots of genes/patterns/pathways/functions etc, with a wide range of conclusions, perhaps in the absence of being able to perform a high- throughput validation of every gene (or a proportion of) in the final 'significant' list. I can see it from both sides...however, sometimes it's easy to lose sight that an array hybridisation is just a hypothesis generator, not a hypothesis solver. That said any attempt to standardise this sort of reporting must have parity and (importantly) transparency with all these factors to have any success. > > I don't actually think there is a single valid answer to this issue, as there are so many interpretations/angles; it's just interesting to see how the debate changes over time. And essential to keep having it too! > > Thanks for reading - I have lots of thoughts about this! > Matt > ---------------------- > Matthew Arno, Ph.D. > Genomics Centre Manager > King's College London > > The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals. > This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender. > > > >> -----Original Message----- >> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >> Sent: 13 June 2011 14:14 >> To: Arno, Matthew >> Cc: bioconductor at r-project.org >> Subject: Re: [BioC] PreFiltering probe in microarray analysis >> >> Thanks, Matt, for joining this discussion, >> >> It is true from Biologist point of view. You always get the top 10 genes >> no matter filtering or not. But this shifts to another question, the >> 'amazingly good FDR'. For the same top ten gene, people can report >> different FDRs by filtering or no filtering, or by filtering a different >> number of genes. These FDRs in different reports are not comparable at >> all. Does this FDR make sense? People can try to make it amazing good. >> Does that sound a little 'cheating'? Sorry I do not mean a real cheating >> here. >> >> Do you have any thought about this ? >> >> Best wishes, >> >> Wayne >> -- >> >> >> >>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as >>> long you know the pitfalls, in terms of the potential bias and affect >> on >>> FDRs. I am personally aware of people pre-filtering not only to >> enhance >>> the FDR, but to use the results of a t-test as a starting point for a >>> second sequential t-test because the FDRs from this test are >> 'amazingly >>> good'. >>> >>> However statistically sacrilegious this is, the top 10 genes are >> always >>> going to be the same top 10 genes, so if you are just looking for the >> top >>> 10 genes, this is essentially OK. >>> >>> How does that hang with you guys? >>> >>> Matt >>> >>> ---------------------- >>> Matthew Arno, Ph.D. >>> Genomics Centre Manager >>> King's College London >>> >>> The contents of this email are strictly confidential. It may not be >>> transmitted in part or in whole to any other individual or groups of >>> individuals. >>> This email is intended solely for the use of the individual(s) to whom >>> they are addressed and should not be released to any third party >> without >>> the consent of the sender. >>> >>> >>> >>>> -----Original Message----- >>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >> bounces at r- >>>> project.org] On Behalf Of wxu at msi.umn.edu >>>> Sent: 12 June 2011 16:41 >>>> To: Wolfgang Huber >>>> Cc: bioconductor at r-project.org >>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>> >>>> Hi, Dear Wolfgang, >>>> >>>> I think it would nice to bring up a discussion here about the gene >>>> prefiltering issue. Please point me out if this suggestion is >>>> inappropriate. >>>> >>>> There are two questions in the gene filtering which I could not find >>>> answers: >>>> 1). In the traditional multiple tests to correct the p-values of many >>>> test >>>> groups for example, in a new drug effect experiment, is it appropriate >>>> to >>>> remove some group tests from the whole experiment? If not, why can we >>>> prefilter the genes? >>>> 2). As I stated in the previous email, we assume that the raw pvalues >>>> and >>>> the top lowest-pvalue genes are the same before (35k genes) and after >>>> gene >>>> filtering (5k genes), the gene x you selected from 35K versus the one >>>> selected from 5K, which is more sound? In other words, the best >> student >>>> selected from 1000 students versus the best student selected from 100, >>>> which is more sound? >>>> >>>> So this is a question of the whole point of gene prefiltering >> approach. >>>> Best wishes, >>>> >>>> Wayne >>>> -- >>>>> Hi Swapna >>>>> >>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>>> Hi Stephanie, >>>>>> There is another recent paper that you might consider which also >>>>>> cautions about filtering >>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, >> FDR >>>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>>> They also recommend their own statistical test to see if one's >> filter >>>>>> biases FDR. >>>>>> currently I am trying variance filter and feature filter from >>>>>> genefilter package: try ?nsFilter for help on these functions. >>>>>> However, I dont use filtering routinely since choosing the right >>>>>> filter , parameters and testing the effects of any bias are things >> I >>>>>> have not worked out in addition to having read Bourgon et al and >>>>>> Iterson et al and others that discuss this issue. >>>>>> About your limma results, while conventional filtering may be >>>> expected >>>>>> to increase the number of significant genes, as the papers suggest >>>>>> likelihood of false positives also increases. >>>>> No. Properly applied filtering does not affect the false positive >>>> rates >>>>> (FWER or FDR). That's the whole point of it. [1] >>>>> >>>>> If one is willing to put up with higher rate or probability of false >>>>> discoveries, then don't do filtering - just increase the p-value >>>> cutoff. >>>>> [1] Bourgon et al., PNAS 2010. >>>>> >>>>>> In your current results, >>>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>>> explore the biological relevance of those genes with high FC and >>>>>> significant unadjusted p value. >>>>>> Best, >>>>>> Swapna >>>>> Best wishes >>>>> Wolfgang Huber >>>>> EMBL >>>>> http://www.embl.de/research/units/genome_biology/huber >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>
ADD REPLY
0
Entering edit mode
Hi Matt, Let me note that PCR (or even protein analysis) performed on SAME samples does not solve the FDR problem. It will only confirm that microarrays reported correct expression levels (or fold change). So now we are sure that in 3 samples under condition A the level of some gene is indeed higher than in 3 samples under condition B, but we still do not know whether this is a true phenomenon distinguishing conditions A and B or this just happened by chance since we have thousands (or tens of thousands) of genes. You will need additional (independent) samples to confirm that this is a true phenomenon. Moshe. > Dear Matt, > > I read your email again. Since you have lots of thoughts about this > issue, I guess you probably have also thought a lot about the solutions. > Hope my continuing followup is not boring. Please point out if I am > wrong in my words. > > There is no question (actually less questions) about the experiment > result such as RT-PCR result of the differentially expressed gene > detection. > > However, when we test many genes in microarray or RNAseq, we do need > something like FDR to control how many genes we are going to report. > Eeven thought this FDR is not "absolutely true false discovery rate", it > can work as a relative controller. The point is when different people > use the same FDR method the FDR reports should be comparable. > > Usually people will not do gene prefiltering first, and do it only when > they find the FDR is too high. If you report a gene list with very high > FDR, the reviewers will reject the paper. Therefore people try to make > an amazing good FDR by gene prefiltering. The same gene list that had a > high FDR before the gene prefiltering now has a lower FDR. Then the > reviewers would be happy with the good FDR. > > It seems, in some cases," with this FDR method, we have to do gene > prefiltering in order to get a good FDR". We can see here that there are > two problems. One is the FDR method itself, and the other is the gene > prefiltering approach. > > Having thought a lot about these problems, I came out a solution called > EDR in which I have addressed these problems: > http://www.ncbi.nlm.nih.gov/pubmed/20846437 > > Have you read this paper? Do you think that could be one of the > standardized solutions? or any comments would be appreciated, > > Best wishes, > > Wayne > > -- > -------------------------------------------------------------------- --- > Wayne Xu, Ph.D > Computational Genomics Specialist > > Supercomputing Institute for Advanced Computational Research > 550 Walter Library > 117 Pleasant Street SE > University of Minnesota > Minneapolis, Minnesota 55455 > email: wxu at msi.umn.edu help email: help at msi.umn.edu > phone: 612-624-1447 help phone: 612-626-0802 > fax: 612-624-8861 > -------------------------------------------------------------------- --- > > > > --On 6/13/2011 9:01 AM, Arno, Matthew wrote: >> Wayne - I *definitely* mean cheating! It depends on whether the FDR is >> reported I suppose. Let's say you do a microarray screen and the 'most >> changed' gene that comes up (either by largest fold change or smallest >> t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on >> to validate the change (on the same samples and further test sets) using >> qPCR and or western blots etc., if you go as far as protein analysis. >> Therefore you can analyse the importance of that single gene in a real >> biological context. No one could argue that the gene is not changed in >> the study and other samples, because of the low-throughput validation, >> and it makes a nice biological story for a paper. This is regardless of >> the arrays used, the test used, the FDR or actual p-value even. You >> could have picked the gene by sticking a pin in a list; you just used an >> array to make that pin stick more likely to give a real change. >> >> However, the statistical factors do definitely matter when you are >> trying to report an overall analysis with lots of >> genes/patterns/pathways/functions etc, with a wide range of conclusions, >> perhaps in the absence of being able to perform a high-throughput >> validation of every gene (or a proportion of) in the final 'significant' >> list. I can see it from both sides...however, sometimes it's easy to >> lose sight that an array hybridisation is just a hypothesis generator, >> not a hypothesis solver. That said any attempt to standardise this sort >> of reporting must have parity and (importantly) transparency with all >> these factors to have any success. >> >> I don't actually think there is a single valid answer to this issue, as >> there are so many interpretations/angles; it's just interesting to see >> how the debate changes over time. And essential to keep having it too! >> >> Thanks for reading - I have lots of thoughts about this! >> Matt >> ---------------------- >> Matthew Arno, Ph.D. >> Genomics Centre Manager >> King's College London >> >> The contents of this email are strictly confidential. It may not be >> transmitted in part or in whole to any other individual or groups of >> individuals. >> This email is intended solely for the use of the individual(s) to whom >> they are addressed and should not be released to any third party without >> the consent of the sender. >> >> >> >>> -----Original Message----- >>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >>> Sent: 13 June 2011 14:14 >>> To: Arno, Matthew >>> Cc: bioconductor at r-project.org >>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>> >>> Thanks, Matt, for joining this discussion, >>> >>> It is true from Biologist point of view. You always get the top 10 >>> genes >>> no matter filtering or not. But this shifts to another question, the >>> 'amazingly good FDR'. For the same top ten gene, people can report >>> different FDRs by filtering or no filtering, or by filtering a >>> different >>> number of genes. These FDRs in different reports are not comparable at >>> all. Does this FDR make sense? People can try to make it amazing good. >>> Does that sound a little 'cheating'? Sorry I do not mean a real >>> cheating >>> here. >>> >>> Do you have any thought about this ? >>> >>> Best wishes, >>> >>> Wayne >>> -- >>> >>> >>> >>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as >>>> long you know the pitfalls, in terms of the potential bias and affect >>> on >>>> FDRs. I am personally aware of people pre-filtering not only to >>> enhance >>>> the FDR, but to use the results of a t-test as a starting point for a >>>> second sequential t-test because the FDRs from this test are >>> 'amazingly >>>> good'. >>>> >>>> However statistically sacrilegious this is, the top 10 genes are >>> always >>>> going to be the same top 10 genes, so if you are just looking for the >>> top >>>> 10 genes, this is essentially OK. >>>> >>>> How does that hang with you guys? >>>> >>>> Matt >>>> >>>> ---------------------- >>>> Matthew Arno, Ph.D. >>>> Genomics Centre Manager >>>> King's College London >>>> >>>> The contents of this email are strictly confidential. It may not be >>>> transmitted in part or in whole to any other individual or groups of >>>> individuals. >>>> This email is intended solely for the use of the individual(s) to whom >>>> they are addressed and should not be released to any third party >>> without >>>> the consent of the sender. >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >>> bounces at r- >>>>> project.org] On Behalf Of wxu at msi.umn.edu >>>>> Sent: 12 June 2011 16:41 >>>>> To: Wolfgang Huber >>>>> Cc: bioconductor at r-project.org >>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>>> >>>>> Hi, Dear Wolfgang, >>>>> >>>>> I think it would nice to bring up a discussion here about the gene >>>>> prefiltering issue. Please point me out if this suggestion is >>>>> inappropriate. >>>>> >>>>> There are two questions in the gene filtering which I could not find >>>>> answers: >>>>> 1). In the traditional multiple tests to correct the p-values of many >>>>> test >>>>> groups for example, in a new drug effect experiment, is it >>>>> appropriate >>>>> to >>>>> remove some group tests from the whole experiment? If not, why can we >>>>> prefilter the genes? >>>>> 2). As I stated in the previous email, we assume that the raw pvalues >>>>> and >>>>> the top lowest-pvalue genes are the same before (35k genes) and after >>>>> gene >>>>> filtering (5k genes), the gene x you selected from 35K versus the one >>>>> selected from 5K, which is more sound? In other words, the best >>> student >>>>> selected from 1000 students versus the best student selected from >>>>> 100, >>>>> which is more sound? >>>>> >>>>> So this is a question of the whole point of gene prefiltering >>> approach. >>>>> Best wishes, >>>>> >>>>> Wayne >>>>> -- >>>>>> Hi Swapna >>>>>> >>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>>>> Hi Stephanie, >>>>>>> There is another recent paper that you might consider which also >>>>>>> cautions about filtering >>>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, >>> FDR >>>>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>>>> They also recommend their own statistical test to see if one's >>> filter >>>>>>> biases FDR. >>>>>>> currently I am trying variance filter and feature filter from >>>>>>> genefilter package: try ?nsFilter for help on these functions. >>>>>>> However, I dont use filtering routinely since choosing the right >>>>>>> filter , parameters and testing the effects of any bias are things >>> I >>>>>>> have not worked out in addition to having read Bourgon et al and >>>>>>> Iterson et al and others that discuss this issue. >>>>>>> About your limma results, while conventional filtering may be >>>>> expected >>>>>>> to increase the number of significant genes, as the papers suggest >>>>>>> likelihood of false positives also increases. >>>>>> No. Properly applied filtering does not affect the false positive >>>>> rates >>>>>> (FWER or FDR). That's the whole point of it. [1] >>>>>> >>>>>> If one is willing to put up with higher rate or probability of false >>>>>> discoveries, then don't do filtering - just increase the p-value >>>>> cutoff. >>>>>> [1] Bourgon et al., PNAS 2010. >>>>>> >>>>>>> In your current results, >>>>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>>>> explore the biological relevance of those genes with high FC and >>>>>>> significant unadjusted p value. >>>>>>> Best, >>>>>>> Swapna >>>>>> Best wishes >>>>>> Wolfgang Huber >>>>>> EMBL >>>>>> http://www.embl.de/research/units/genome_biology/huber >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Moshe Olshansky Division of Bioinformatics The Walter & Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052 e-mail: olshansky at wehi.edu.au tel: (03) 9345 2631 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLY
0
Entering edit mode
Not necessarily. PCR has wider dynamic range and greater precision than microarrays. The improved precision _may_ mean that we have substantially more evidence for differential expression based on the PCR results (even on the same samples) than we did just from the analysis of the microarray data. I do, however, agree that additional independent samples are the best solution (and am amazed to have found myself writing the first paragraph of this response...) Kevin On 6/16/2011 10:18 PM, Moshe Olshansky wrote: > Hi Matt, > > Let me note that PCR (or even protein analysis) performed on SAME samples > does not solve the FDR problem. It will only confirm that microarrays > reported correct expression levels (or fold change). So now we are sure > that in 3 samples under condition A the level of some gene is indeed > higher than in 3 samples under condition B, but we still do not know > whether this is a true phenomenon distinguishing conditions A and B or > this just happened by chance since we have thousands (or tens of > thousands) of genes. > You will need additional (independent) samples to confirm that this is a > true phenomenon. > > Moshe. > >> Dear Matt, >> >> I read your email again. Since you have lots of thoughts about this >> issue, I guess you probably have also thought a lot about the solutions. >> Hope my continuing followup is not boring. Please point out if I am >> wrong in my words. >> >> There is no question (actually less questions) about the experiment >> result such as RT-PCR result of the differentially expressed gene >> detection. >> >> However, when we test many genes in microarray or RNAseq, we do need >> something like FDR to control how many genes we are going to report. >> Eeven thought this FDR is not "absolutely true false discovery rate", it >> can work as a relative controller. The point is when different people >> use the same FDR method the FDR reports should be comparable. >> >> Usually people will not do gene prefiltering first, and do it only when >> they find the FDR is too high. If you report a gene list with very high >> FDR, the reviewers will reject the paper. Therefore people try to make >> an amazing good FDR by gene prefiltering. The same gene list that had a >> high FDR before the gene prefiltering now has a lower FDR. Then the >> reviewers would be happy with the good FDR. >> >> It seems, in some cases," with this FDR method, we have to do gene >> prefiltering in order to get a good FDR". We can see here that there are >> two problems. One is the FDR method itself, and the other is the gene >> prefiltering approach. >> >> Having thought a lot about these problems, I came out a solution called >> EDR in which I have addressed these problems: >> http://www.ncbi.nlm.nih.gov/pubmed/20846437 >> >> Have you read this paper? Do you think that could be one of the >> standardized solutions? or any comments would be appreciated, >> >> Best wishes, >> >> Wayne >> >> -- >> ------------------------------------------------------------------- ---- >> Wayne Xu, Ph.D >> Computational Genomics Specialist >> >> Supercomputing Institute for Advanced Computational Research >> 550 Walter Library >> 117 Pleasant Street SE >> University of Minnesota >> Minneapolis, Minnesota 55455 >> email: wxu at msi.umn.edu help email: help at msi.umn.edu >> phone: 612-624-1447 help phone: 612-626-0802 >> fax: 612-624-8861 >> ------------------------------------------------------------------- ---- >> >> >> >> --On 6/13/2011 9:01 AM, Arno, Matthew wrote: >>> Wayne - I *definitely* mean cheating! It depends on whether the FDR is >>> reported I suppose. Let's say you do a microarray screen and the 'most >>> changed' gene that comes up (either by largest fold change or smallest >>> t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on >>> to validate the change (on the same samples and further test sets) using >>> qPCR and or western blots etc., if you go as far as protein analysis. >>> Therefore you can analyse the importance of that single gene in a real >>> biological context. No one could argue that the gene is not changed in >>> the study and other samples, because of the low-throughput validation, >>> and it makes a nice biological story for a paper. This is regardless of >>> the arrays used, the test used, the FDR or actual p-value even. You >>> could have picked the gene by sticking a pin in a list; you just used an >>> array to make that pin stick more likely to give a real change. >>> >>> However, the statistical factors do definitely matter when you are >>> trying to report an overall analysis with lots of >>> genes/patterns/pathways/functions etc, with a wide range of conclusions, >>> perhaps in the absence of being able to perform a high-throughput >>> validation of every gene (or a proportion of) in the final 'significant' >>> list. I can see it from both sides...however, sometimes it's easy to >>> lose sight that an array hybridisation is just a hypothesis generator, >>> not a hypothesis solver. That said any attempt to standardise this sort >>> of reporting must have parity and (importantly) transparency with all >>> these factors to have any success. >>> >>> I don't actually think there is a single valid answer to this issue, as >>> there are so many interpretations/angles; it's just interesting to see >>> how the debate changes over time. And essential to keep having it too! >>> >>> Thanks for reading - I have lots of thoughts about this! >>> Matt >>> ---------------------- >>> Matthew Arno, Ph.D. >>> Genomics Centre Manager >>> King's College London >>> >>> The contents of this email are strictly confidential. It may not be >>> transmitted in part or in whole to any other individual or groups of >>> individuals. >>> This email is intended solely for the use of the individual(s) to whom >>> they are addressed and should not be released to any third party without >>> the consent of the sender. >>> >>> >>> >>>> -----Original Message----- >>>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >>>> Sent: 13 June 2011 14:14 >>>> To: Arno, Matthew >>>> Cc: bioconductor at r-project.org >>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>> >>>> Thanks, Matt, for joining this discussion, >>>> >>>> It is true from Biologist point of view. You always get the top 10 >>>> genes >>>> no matter filtering or not. But this shifts to another question, the >>>> 'amazingly good FDR'. For the same top ten gene, people can report >>>> different FDRs by filtering or no filtering, or by filtering a >>>> different >>>> number of genes. These FDRs in different reports are not comparable at >>>> all. Does this FDR make sense? People can try to make it amazing good. >>>> Does that sound a little 'cheating'? Sorry I do not mean a real >>>> cheating >>>> here. >>>> >>>> Do you have any thought about this ? >>>> >>>> Best wishes, >>>> >>>> Wayne >>>> -- >>>> >>>> >>>> >>>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as >>>>> long you know the pitfalls, in terms of the potential bias and affect >>>> on >>>>> FDRs. I am personally aware of people pre-filtering not only to >>>> enhance >>>>> the FDR, but to use the results of a t-test as a starting point for a >>>>> second sequential t-test because the FDRs from this test are >>>> 'amazingly >>>>> good'. >>>>> >>>>> However statistically sacrilegious this is, the top 10 genes are >>>> always >>>>> going to be the same top 10 genes, so if you are just looking for the >>>> top >>>>> 10 genes, this is essentially OK. >>>>> >>>>> How does that hang with you guys? >>>>> >>>>> Matt >>>>> >>>>> ---------------------- >>>>> Matthew Arno, Ph.D. >>>>> Genomics Centre Manager >>>>> King's College London >>>>> >>>>> The contents of this email are strictly confidential. It may not be >>>>> transmitted in part or in whole to any other individual or groups of >>>>> individuals. >>>>> This email is intended solely for the use of the individual(s) to whom >>>>> they are addressed and should not be released to any third party >>>> without >>>>> the consent of the sender. >>>>> >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >>>> bounces at r- >>>>>> project.org] On Behalf Of wxu at msi.umn.edu >>>>>> Sent: 12 June 2011 16:41 >>>>>> To: Wolfgang Huber >>>>>> Cc: bioconductor at r-project.org >>>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>>>> >>>>>> Hi, Dear Wolfgang, >>>>>> >>>>>> I think it would nice to bring up a discussion here about the gene >>>>>> prefiltering issue. Please point me out if this suggestion is >>>>>> inappropriate. >>>>>> >>>>>> There are two questions in the gene filtering which I could not find >>>>>> answers: >>>>>> 1). In the traditional multiple tests to correct the p-values of many >>>>>> test >>>>>> groups for example, in a new drug effect experiment, is it >>>>>> appropriate >>>>>> to >>>>>> remove some group tests from the whole experiment? If not, why can we >>>>>> prefilter the genes? >>>>>> 2). As I stated in the previous email, we assume that the raw pvalues >>>>>> and >>>>>> the top lowest-pvalue genes are the same before (35k genes) and after >>>>>> gene >>>>>> filtering (5k genes), the gene x you selected from 35K versus the one >>>>>> selected from 5K, which is more sound? In other words, the best >>>> student >>>>>> selected from 1000 students versus the best student selected from >>>>>> 100, >>>>>> which is more sound? >>>>>> >>>>>> So this is a question of the whole point of gene prefiltering >>>> approach. >>>>>> Best wishes, >>>>>> >>>>>> Wayne >>>>>> -- >>>>>>> Hi Swapna >>>>>>> >>>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>>>>> Hi Stephanie, >>>>>>>> There is another recent paper that you might consider which also >>>>>>>> cautions about filtering >>>>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, >>>> FDR >>>>>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>>>>> They also recommend their own statistical test to see if one's >>>> filter >>>>>>>> biases FDR. >>>>>>>> currently I am trying variance filter and feature filter from >>>>>>>> genefilter package: try ?nsFilter for help on these functions. >>>>>>>> However, I dont use filtering routinely since choosing the right >>>>>>>> filter , parameters and testing the effects of any bias are things >>>> I >>>>>>>> have not worked out in addition to having read Bourgon et al and >>>>>>>> Iterson et al and others that discuss this issue. >>>>>>>> About your limma results, while conventional filtering may be >>>>>> expected >>>>>>>> to increase the number of significant genes, as the papers suggest >>>>>>>> likelihood of false positives also increases. >>>>>>> No. Properly applied filtering does not affect the false positive >>>>>> rates >>>>>>> (FWER or FDR). That's the whole point of it. [1] >>>>>>> >>>>>>> If one is willing to put up with higher rate or probability of false >>>>>>> discoveries, then don't do filtering - just increase the p-value >>>>>> cutoff. >>>>>>> [1] Bourgon et al., PNAS 2010. >>>>>>> >>>>>>>> In your current results, >>>>>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>>>>> explore the biological relevance of those genes with high FC and >>>>>>>> significant unadjusted p value. >>>>>>>> Best, >>>>>>>> Swapna >>>>>>> Best wishes >>>>>>> Wolfgang Huber >>>>>>> EMBL >>>>>>> http://www.embl.de/research/units/genome_biology/huber >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY
0
Entering edit mode
Hi Kevin, Thank you for your explanation. Moshe. P.S. How about RT qPCR where one has a plate with 384 wells - a step towards high throughput PCR - is it as accurate as traditional PCR? > Not necessarily. PCR has wider dynamic range and greater precision than > microarrays. The improved precision _may_ mean that we have > substantially more evidence for differential expression based on the PCR > results (even on the same samples) than we did just from the analysis of > the microarray data. > > I do, however, agree that additional independent samples are the best > solution (and am amazed to have found myself writing the first paragraph > of this response...) > > Kevin > > On 6/16/2011 10:18 PM, Moshe Olshansky wrote: >> Hi Matt, >> >> Let me note that PCR (or even protein analysis) performed on SAME >> samples >> does not solve the FDR problem. It will only confirm that microarrays >> reported correct expression levels (or fold change). So now we are sure >> that in 3 samples under condition A the level of some gene is indeed >> higher than in 3 samples under condition B, but we still do not know >> whether this is a true phenomenon distinguishing conditions A and B or >> this just happened by chance since we have thousands (or tens of >> thousands) of genes. >> You will need additional (independent) samples to confirm that this is a >> true phenomenon. >> >> Moshe. >> >>> Dear Matt, >>> >>> I read your email again. Since you have lots of thoughts about this >>> issue, I guess you probably have also thought a lot about the >>> solutions. >>> Hope my continuing followup is not boring. Please point out if I am >>> wrong in my words. >>> >>> There is no question (actually less questions) about the experiment >>> result such as RT-PCR result of the differentially expressed gene >>> detection. >>> >>> However, when we test many genes in microarray or RNAseq, we do need >>> something like FDR to control how many genes we are going to report. >>> Eeven thought this FDR is not "absolutely true false discovery rate", >>> it >>> can work as a relative controller. The point is when different people >>> use the same FDR method the FDR reports should be comparable. >>> >>> Usually people will not do gene prefiltering first, and do it only when >>> they find the FDR is too high. If you report a gene list with very high >>> FDR, the reviewers will reject the paper. Therefore people try to make >>> an amazing good FDR by gene prefiltering. The same gene list that had a >>> high FDR before the gene prefiltering now has a lower FDR. Then the >>> reviewers would be happy with the good FDR. >>> >>> It seems, in some cases," with this FDR method, we have to do gene >>> prefiltering in order to get a good FDR". We can see here that there >>> are >>> two problems. One is the FDR method itself, and the other is the gene >>> prefiltering approach. >>> >>> Having thought a lot about these problems, I came out a solution called >>> EDR in which I have addressed these problems: >>> http://www.ncbi.nlm.nih.gov/pubmed/20846437 >>> >>> Have you read this paper? Do you think that could be one of the >>> standardized solutions? or any comments would be appreciated, >>> >>> Best wishes, >>> >>> Wayne >>> >>> -- >>> ------------------------------------------------------------------ ----- >>> Wayne Xu, Ph.D >>> Computational Genomics Specialist >>> >>> Supercomputing Institute for Advanced Computational Research >>> 550 Walter Library >>> 117 Pleasant Street SE >>> University of Minnesota >>> Minneapolis, Minnesota 55455 >>> email: wxu at msi.umn.edu help email: help at msi.umn.edu >>> phone: 612-624-1447 help phone: 612-626-0802 >>> fax: 612-624-8861 >>> ------------------------------------------------------------------ ----- >>> >>> >>> >>> --On 6/13/2011 9:01 AM, Arno, Matthew wrote: >>>> Wayne - I *definitely* mean cheating! It depends on whether the FDR is >>>> reported I suppose. Let's say you do a microarray screen and the 'most >>>> changed' gene that comes up (either by largest fold change or smallest >>>> t-test/ANOVA p-value) is 'interesting' biologically speaking. You go >>>> on >>>> to validate the change (on the same samples and further test sets) >>>> using >>>> qPCR and or western blots etc., if you go as far as protein analysis. >>>> Therefore you can analyse the importance of that single gene in a real >>>> biological context. No one could argue that the gene is not changed in >>>> the study and other samples, because of the low-throughput validation, >>>> and it makes a nice biological story for a paper. This is regardless >>>> of >>>> the arrays used, the test used, the FDR or actual p-value even. You >>>> could have picked the gene by sticking a pin in a list; you just used >>>> an >>>> array to make that pin stick more likely to give a real change. >>>> >>>> However, the statistical factors do definitely matter when you are >>>> trying to report an overall analysis with lots of >>>> genes/patterns/pathways/functions etc, with a wide range of >>>> conclusions, >>>> perhaps in the absence of being able to perform a high-throughput >>>> validation of every gene (or a proportion of) in the final >>>> 'significant' >>>> list. I can see it from both sides...however, sometimes it's easy to >>>> lose sight that an array hybridisation is just a hypothesis generator, >>>> not a hypothesis solver. That said any attempt to standardise this >>>> sort >>>> of reporting must have parity and (importantly) transparency with all >>>> these factors to have any success. >>>> >>>> I don't actually think there is a single valid answer to this issue, >>>> as >>>> there are so many interpretations/angles; it's just interesting to see >>>> how the debate changes over time. And essential to keep having it too! >>>> >>>> Thanks for reading - I have lots of thoughts about this! >>>> Matt >>>> ---------------------- >>>> Matthew Arno, Ph.D. >>>> Genomics Centre Manager >>>> King's College London >>>> >>>> The contents of this email are strictly confidential. It may not be >>>> transmitted in part or in whole to any other individual or groups of >>>> individuals. >>>> This email is intended solely for the use of the individual(s) to whom >>>> they are addressed and should not be released to any third party >>>> without >>>> the consent of the sender. >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >>>>> Sent: 13 June 2011 14:14 >>>>> To: Arno, Matthew >>>>> Cc: bioconductor at r-project.org >>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>>> >>>>> Thanks, Matt, for joining this discussion, >>>>> >>>>> It is true from Biologist point of view. You always get the top 10 >>>>> genes >>>>> no matter filtering or not. But this shifts to another question, the >>>>> 'amazingly good FDR'. For the same top ten gene, people can report >>>>> different FDRs by filtering or no filtering, or by filtering a >>>>> different >>>>> number of genes. These FDRs in different reports are not comparable >>>>> at >>>>> all. Does this FDR make sense? People can try to make it amazing >>>>> good. >>>>> Does that sound a little 'cheating'? Sorry I do not mean a real >>>>> cheating >>>>> here. >>>>> >>>>> Do you have any thought about this ? >>>>> >>>>> Best wishes, >>>>> >>>>> Wayne >>>>> -- >>>>> >>>>> >>>>> >>>>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes >>>>>> as >>>>>> long you know the pitfalls, in terms of the potential bias and >>>>>> affect >>>>> on >>>>>> FDRs. I am personally aware of people pre-filtering not only to >>>>> enhance >>>>>> the FDR, but to use the results of a t-test as a starting point for >>>>>> a >>>>>> second sequential t-test because the FDRs from this test are >>>>> 'amazingly >>>>>> good'. >>>>>> >>>>>> However statistically sacrilegious this is, the top 10 genes are >>>>> always >>>>>> going to be the same top 10 genes, so if you are just looking for >>>>>> the >>>>> top >>>>>> 10 genes, this is essentially OK. >>>>>> >>>>>> How does that hang with you guys? >>>>>> >>>>>> Matt >>>>>> >>>>>> ---------------------- >>>>>> Matthew Arno, Ph.D. >>>>>> Genomics Centre Manager >>>>>> King's College London >>>>>> >>>>>> The contents of this email are strictly confidential. It may not be >>>>>> transmitted in part or in whole to any other individual or groups of >>>>>> individuals. >>>>>> This email is intended solely for the use of the individual(s) to >>>>>> whom >>>>>> they are addressed and should not be released to any third party >>>>> without >>>>>> the consent of the sender. >>>>>> >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >>>>> bounces at r- >>>>>>> project.org] On Behalf Of wxu at msi.umn.edu >>>>>>> Sent: 12 June 2011 16:41 >>>>>>> To: Wolfgang Huber >>>>>>> Cc: bioconductor at r-project.org >>>>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>>>>> >>>>>>> Hi, Dear Wolfgang, >>>>>>> >>>>>>> I think it would nice to bring up a discussion here about the gene >>>>>>> prefiltering issue. Please point me out if this suggestion is >>>>>>> inappropriate. >>>>>>> >>>>>>> There are two questions in the gene filtering which I could not >>>>>>> find >>>>>>> answers: >>>>>>> 1). In the traditional multiple tests to correct the p-values of >>>>>>> many >>>>>>> test >>>>>>> groups for example, in a new drug effect experiment, is it >>>>>>> appropriate >>>>>>> to >>>>>>> remove some group tests from the whole experiment? If not, why can >>>>>>> we >>>>>>> prefilter the genes? >>>>>>> 2). As I stated in the previous email, we assume that the raw >>>>>>> pvalues >>>>>>> and >>>>>>> the top lowest-pvalue genes are the same before (35k genes) and >>>>>>> after >>>>>>> gene >>>>>>> filtering (5k genes), the gene x you selected from 35K versus the >>>>>>> one >>>>>>> selected from 5K, which is more sound? In other words, the best >>>>> student >>>>>>> selected from 1000 students versus the best student selected from >>>>>>> 100, >>>>>>> which is more sound? >>>>>>> >>>>>>> So this is a question of the whole point of gene prefiltering >>>>> approach. >>>>>>> Best wishes, >>>>>>> >>>>>>> Wayne >>>>>>> -- >>>>>>>> Hi Swapna >>>>>>>> >>>>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>>>>>> Hi Stephanie, >>>>>>>>> There is another recent paper that you might consider which also >>>>>>>>> cautions about filtering >>>>>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). >>>>>>>>> Filtering, >>>>> FDR >>>>>>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>>>>>> They also recommend their own statistical test to see if one's >>>>> filter >>>>>>>>> biases FDR. >>>>>>>>> currently I am trying variance filter and feature filter from >>>>>>>>> genefilter package: try ?nsFilter for help on these functions. >>>>>>>>> However, I dont use filtering routinely since choosing the right >>>>>>>>> filter , parameters and testing the effects of any bias are >>>>>>>>> things >>>>> I >>>>>>>>> have not worked out in addition to having read Bourgon et al and >>>>>>>>> Iterson et al and others that discuss this issue. >>>>>>>>> About your limma results, while conventional filtering may be >>>>>>> expected >>>>>>>>> to increase the number of significant genes, as the papers >>>>>>>>> suggest >>>>>>>>> likelihood of false positives also increases. >>>>>>>> No. Properly applied filtering does not affect the false positive >>>>>>> rates >>>>>>>> (FWER or FDR). That's the whole point of it. [1] >>>>>>>> >>>>>>>> If one is willing to put up with higher rate or probability of >>>>>>>> false >>>>>>>> discoveries, then don't do filtering - just increase the p-value >>>>>>> cutoff. >>>>>>>> [1] Bourgon et al., PNAS 2010. >>>>>>>> >>>>>>>>> In your current results, >>>>>>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>>>>>> explore the biological relevance of those genes with high FC and >>>>>>>>> significant unadjusted p value. >>>>>>>>> Best, >>>>>>>>> Swapna >>>>>>>> Best wishes >>>>>>>> Wolfgang Huber >>>>>>>> EMBL >>>>>>>> http://www.embl.de/research/units/genome_biology/huber >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLY
0
Entering edit mode
I agree entirely - with all these points. They are all perfectly valid in defining an ideal or minimum standard for publication of microarray analyses in their entirety. Indeed that's what this is all about. I am not trying to convert anyone, or propose lowering standards. My angle is just that sometimes either the study design or the data itself is not up to these standards (and is not intended to be). This is no bar to extracting some useful biological information from the data, as long as the limitations of this are fully recognised, i.e. you don't try and publish gene lists from a single-replicate array study in Cell, prefiltered to remove genes 'below 50' and with no multiple testing correction. In running a core facility and analysing data from studies where money is tight (and getting tighter) I find myself in the position where there is no chance that data can meet the standards talked about here, but short of saying "go away, spend another ?20,000 and don't waste my time" I usually try to glean *some* kind of hypotheses from it. Results vary between interesting biological patterns worthy of further work, perhaps publication, to nothing significant at all. Matt ---------------------- Matthew Arno, Ph.D. Genomics Centre Manager King's College London The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals. This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender. >-----Original Message----- >From: Moshe Olshansky [mailto:olshansky at wehi.EDU.AU] >Sent: 17 June 2011 04:19 >To: wayne xu >Cc: Arno, Matthew; bioconductor at r-project.org >Subject: Re: [BioC] PreFiltering probe in microarray analysis > >Hi Matt, > >Let me note that PCR (or even protein analysis) performed on SAME >samples >does not solve the FDR problem. It will only confirm that microarrays >reported correct expression levels (or fold change). So now we are sure >that in 3 samples under condition A the level of some gene is indeed >higher than in 3 samples under condition B, but we still do not know >whether this is a true phenomenon distinguishing conditions A and B or >this just happened by chance since we have thousands (or tens of >thousands) of genes. >You will need additional (independent) samples to confirm that this is a >true phenomenon. > >Moshe. > >> Dear Matt, >> >> I read your email again. Since you have lots of thoughts about this >> issue, I guess you probably have also thought a lot about the >solutions. >> Hope my continuing followup is not boring. Please point out if I am >> wrong in my words. >> >> There is no question (actually less questions) about the experiment >> result such as RT-PCR result of the differentially expressed gene >> detection. >> >> However, when we test many genes in microarray or RNAseq, we do need >> something like FDR to control how many genes we are going to report. >> Eeven thought this FDR is not "absolutely true false discovery rate", >it >> can work as a relative controller. The point is when different people >> use the same FDR method the FDR reports should be comparable. >> >> Usually people will not do gene prefiltering first, and do it only >when >> they find the FDR is too high. If you report a gene list with very >high >> FDR, the reviewers will reject the paper. Therefore people try to make >> an amazing good FDR by gene prefiltering. The same gene list that had >a >> high FDR before the gene prefiltering now has a lower FDR. Then the >> reviewers would be happy with the good FDR. >> >> It seems, in some cases," with this FDR method, we have to do gene >> prefiltering in order to get a good FDR". We can see here that there >are >> two problems. One is the FDR method itself, and the other is the gene >> prefiltering approach. >> >> Having thought a lot about these problems, I came out a solution >called >> EDR in which I have addressed these problems: >> http://www.ncbi.nlm.nih.gov/pubmed/20846437 >> >> Have you read this paper? Do you think that could be one of the >> standardized solutions? or any comments would be appreciated, >> >> Best wishes, >> >> Wayne >> >> -- >> ---------------------------------------------------------------------- >- >> Wayne Xu, Ph.D >> Computational Genomics Specialist >> >> Supercomputing Institute for Advanced Computational Research >> 550 Walter Library >> 117 Pleasant Street SE >> University of Minnesota >> Minneapolis, Minnesota 55455 >> email: wxu at msi.umn.edu help email: help at msi.umn.edu >> phone: 612-624-1447 help phone: 612-626-0802 >> fax: 612-624-8861 >> ---------------------------------------------------------------------- >- >> >> >> >> --On 6/13/2011 9:01 AM, Arno, Matthew wrote: >>> Wayne - I *definitely* mean cheating! It depends on whether the FDR >is >>> reported I suppose. Let's say you do a microarray screen and the >'most >>> changed' gene that comes up (either by largest fold change or >smallest >>> t-test/ANOVA p-value) is 'interesting' biologically speaking. You go >on >>> to validate the change (on the same samples and further test sets) >using >>> qPCR and or western blots etc., if you go as far as protein analysis. >>> Therefore you can analyse the importance of that single gene in a >real >>> biological context. No one could argue that the gene is not changed >in >>> the study and other samples, because of the low-throughput >validation, >>> and it makes a nice biological story for a paper. This is regardless >of >>> the arrays used, the test used, the FDR or actual p-value even. You >>> could have picked the gene by sticking a pin in a list; you just used >an >>> array to make that pin stick more likely to give a real change. >>> >>> However, the statistical factors do definitely matter when you are >>> trying to report an overall analysis with lots of >>> genes/patterns/pathways/functions etc, with a wide range of >conclusions, >>> perhaps in the absence of being able to perform a high-throughput >>> validation of every gene (or a proportion of) in the final >'significant' >>> list. I can see it from both sides...however, sometimes it's easy to >>> lose sight that an array hybridisation is just a hypothesis >generator, >>> not a hypothesis solver. That said any attempt to standardise this >sort >>> of reporting must have parity and (importantly) transparency with all >>> these factors to have any success. >>> >>> I don't actually think there is a single valid answer to this issue, >as >>> there are so many interpretations/angles; it's just interesting to >see >>> how the debate changes over time. And essential to keep having it >too! >>> >>> Thanks for reading - I have lots of thoughts about this! >>> Matt >>> ---------------------- >>> Matthew Arno, Ph.D. >>> Genomics Centre Manager >>> King's College London >>> >>> The contents of this email are strictly confidential. It may not be >>> transmitted in part or in whole to any other individual or groups of >>> individuals. >>> This email is intended solely for the use of the individual(s) to >whom >>> they are addressed and should not be released to any third party >without >>> the consent of the sender. >>> >>> >>> >>>> -----Original Message----- >>>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >>>> Sent: 13 June 2011 14:14 >>>> To: Arno, Matthew >>>> Cc: bioconductor at r-project.org >>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>> >>>> Thanks, Matt, for joining this discussion, >>>> >>>> It is true from Biologist point of view. You always get the top 10 >>>> genes >>>> no matter filtering or not. But this shifts to another question, the >>>> 'amazingly good FDR'. For the same top ten gene, people can report >>>> different FDRs by filtering or no filtering, or by filtering a >>>> different >>>> number of genes. These FDRs in different reports are not comparable >at >>>> all. Does this FDR make sense? People can try to make it amazing >good. >>>> Does that sound a little 'cheating'? Sorry I do not mean a real >>>> cheating >>>> here. >>>> >>>> Do you have any thought about this ? >>>> >>>> Best wishes, >>>> >>>> Wayne >>>> -- >>>> >>>> >>>> >>>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes >as >>>>> long you know the pitfalls, in terms of the potential bias and >affect >>>> on >>>>> FDRs. I am personally aware of people pre-filtering not only to >>>> enhance >>>>> the FDR, but to use the results of a t-test as a starting point for >a >>>>> second sequential t-test because the FDRs from this test are >>>> 'amazingly >>>>> good'. >>>>> >>>>> However statistically sacrilegious this is, the top 10 genes are >>>> always >>>>> going to be the same top 10 genes, so if you are just looking for >the >>>> top >>>>> 10 genes, this is essentially OK. >>>>> >>>>> How does that hang with you guys? >>>>> >>>>> Matt >>>>> >>>>> ---------------------- >>>>> Matthew Arno, Ph.D. >>>>> Genomics Centre Manager >>>>> King's College London >>>>> >>>>> The contents of this email are strictly confidential. It may not be >>>>> transmitted in part or in whole to any other individual or groups >of >>>>> individuals. >>>>> This email is intended solely for the use of the individual(s) to >whom >>>>> they are addressed and should not be released to any third party >>>> without >>>>> the consent of the sender. >>>>> >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >>>> bounces at r- >>>>>> project.org] On Behalf Of wxu at msi.umn.edu >>>>>> Sent: 12 June 2011 16:41 >>>>>> To: Wolfgang Huber >>>>>> Cc: bioconductor at r-project.org >>>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>>>> >>>>>> Hi, Dear Wolfgang, >>>>>> >>>>>> I think it would nice to bring up a discussion here about the gene >>>>>> prefiltering issue. Please point me out if this suggestion is >>>>>> inappropriate. >>>>>> >>>>>> There are two questions in the gene filtering which I could not >find >>>>>> answers: >>>>>> 1). In the traditional multiple tests to correct the p-values of >many >>>>>> test >>>>>> groups for example, in a new drug effect experiment, is it >>>>>> appropriate >>>>>> to >>>>>> remove some group tests from the whole experiment? If not, why can >we >>>>>> prefilter the genes? >>>>>> 2). As I stated in the previous email, we assume that the raw >pvalues >>>>>> and >>>>>> the top lowest-pvalue genes are the same before (35k genes) and >after >>>>>> gene >>>>>> filtering (5k genes), the gene x you selected from 35K versus the >one >>>>>> selected from 5K, which is more sound? In other words, the best >>>> student >>>>>> selected from 1000 students versus the best student selected from >>>>>> 100, >>>>>> which is more sound? >>>>>> >>>>>> So this is a question of the whole point of gene prefiltering >>>> approach. >>>>>> Best wishes, >>>>>> >>>>>> Wayne >>>>>> -- >>>>>>> Hi Swapna >>>>>>> >>>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>>>>> Hi Stephanie, >>>>>>>> There is another recent paper that you might consider which also >>>>>>>> cautions about filtering >>>>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). >Filtering, >>>> FDR >>>>>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>>>>> They also recommend their own statistical test to see if one's >>>> filter >>>>>>>> biases FDR. >>>>>>>> currently I am trying variance filter and feature filter from >>>>>>>> genefilter package: try ?nsFilter for help on these functions. >>>>>>>> However, I dont use filtering routinely since choosing the right >>>>>>>> filter , parameters and testing the effects of any bias are >things >>>> I >>>>>>>> have not worked out in addition to having read Bourgon et al and >>>>>>>> Iterson et al and others that discuss this issue. >>>>>>>> About your limma results, while conventional filtering may be >>>>>> expected >>>>>>>> to increase the number of significant genes, as the papers >suggest >>>>>>>> likelihood of false positives also increases. >>>>>>> No. Properly applied filtering does not affect the false positive >>>>>> rates >>>>>>> (FWER or FDR). That's the whole point of it. [1] >>>>>>> >>>>>>> If one is willing to put up with higher rate or probability of >false >>>>>>> discoveries, then don't do filtering - just increase the p-value >>>>>> cutoff. >>>>>>> [1] Bourgon et al., PNAS 2010. >>>>>>> >>>>>>>> In your current results, >>>>>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>>>>> explore the biological relevance of those genes with high FC and >>>>>>>> significant unadjusted p value. >>>>>>>> Best, >>>>>>>> Swapna >>>>>>> Best wishes >>>>>>> Wolfgang Huber >>>>>>> EMBL >>>>>>> http://www.embl.de/research/units/genome_biology/huber >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > >-- >Moshe Olshansky >Division of Bioinformatics >The Walter & Eliza Hall Institute of Medical Research >1G Royal Parade, Parkville, Vic 3052 >e-mail: olshansky at wehi.edu.au >tel: (03) 9345 2631 > > >_____________________________________________________________________ _ >The information in this email is confidential and inten...{{dropped:6}}
ADD REPLY
0
Entering edit mode
@wxumsiumnedu-1819
Last seen 10.2 years ago
Hi Matt, I, and all of us have done that way for a long time. Glad to see I am not the only one who argues this approach. Agree, let's see how the debate would change over time. Thanks, Wayne -- > Wayne - I *definitely* mean cheating! It depends on whether the FDR is > reported I suppose. Let's say you do a microarray screen and the 'most > changed' gene that comes up (either by largest fold change or smallest > t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on to > validate the change (on the same samples and further test sets) using qPCR > and or western blots etc., if you go as far as protein analysis. Therefore > you can analyse the importance of that single gene in a real biological > context. No one could argue that the gene is not changed in the study and > other samples, because of the low-throughput validation, and it makes a > nice biological story for a paper. This is regardless of the arrays used, > the test used, the FDR or actual p-value even. You could have picked the > gene by sticking a pin in a list; you just used an array to make that pin > stick more likely to give a real change. > > However, the statistical factors do definitely matter when you are trying > to report an overall analysis with lots of > genes/patterns/pathways/functions etc, with a wide range of conclusions, > perhaps in the absence of being able to perform a high-throughput > validation of every gene (or a proportion of) in the final 'significant' > list. I can see it from both sides...however, sometimes it's easy to lose > sight that an array hybridisation is just a hypothesis generator, not a > hypothesis solver. That said any attempt to standardise this sort of > reporting must have parity and (importantly) transparency with all these > factors to have any success. > > I don't actually think there is a single valid answer to this issue, as > there are so many interpretations/angles; it's just interesting to see how > the debate changes over time. And essential to keep having it too! > > Thanks for reading - I have lots of thoughts about this! > Matt > ---------------------- > Matthew Arno, Ph.D. > Genomics Centre Manager > King's College London > ? > The contents of this email are strictly confidential. It may not be > transmitted in part or in whole to any other individual or groups of > individuals. > This email is intended solely for the use of the individual(s) to whom > they are addressed and should not be released to any third party without > the consent of the sender. > > > >>-----Original Message----- >>From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu] >>Sent: 13 June 2011 14:14 >>To: Arno, Matthew >>Cc: bioconductor at r-project.org >>Subject: Re: [BioC] PreFiltering probe in microarray analysis >> >>Thanks, Matt, for joining this discussion, >> >>It is true from Biologist point of view. You always get the top 10 genes >>no matter filtering or not. But this shifts to another question, the >>'amazingly good FDR'. For the same top ten gene, people can report >>different FDRs by filtering or no filtering, or by filtering a different >>number of genes. These FDRs in different reports are not comparable at >>all. Does this FDR make sense? People can try to make it amazing good. >>Does that sound a little 'cheating'? Sorry I do not mean a real cheating >>here. >> >>Do you have any thought about this ? >> >>Best wishes, >> >>Wayne >>-- >> >> >> >>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as >>> long you know the pitfalls, in terms of the potential bias and affect >>on >>> FDRs. I am personally aware of people pre-filtering not only to >>enhance >>> the FDR, but to use the results of a t-test as a starting point for a >>> second sequential t-test because the FDRs from this test are >>'amazingly >>> good'. >>> >>> However statistically sacrilegious this is, the top 10 genes are >>always >>> going to be the same top 10 genes, so if you are just looking for the >>top >>> 10 genes, this is essentially OK. >>> >>> How does that hang with you guys? >>> >>> Matt >>> >>> ---------------------- >>> Matthew Arno, Ph.D. >>> Genomics Centre Manager >>> King's College London >>> >>> The contents of this email are strictly confidential. It may not be >>> transmitted in part or in whole to any other individual or groups of >>> individuals. >>> This email is intended solely for the use of the individual(s) to whom >>> they are addressed and should not be released to any third party >>without >>> the consent of the sender. >>> >>> >>> >>>>-----Original Message----- >>>>From: bioconductor-bounces at r-project.org [mailto:bioconductor- >>bounces at r- >>>>project.org] On Behalf Of wxu at msi.umn.edu >>>>Sent: 12 June 2011 16:41 >>>>To: Wolfgang Huber >>>>Cc: bioconductor at r-project.org >>>>Subject: Re: [BioC] PreFiltering probe in microarray analysis >>>> >>>>Hi, Dear Wolfgang, >>>> >>>>I think it would nice to bring up a discussion here about the gene >>>>prefiltering issue. Please point me out if this suggestion is >>>>inappropriate. >>>> >>>>There are two questions in the gene filtering which I could not find >>>>answers: >>>>1). In the traditional multiple tests to correct the p-values of many >>>>test >>>>groups for example, in a new drug effect experiment, is it appropriate >>>>to >>>>remove some group tests from the whole experiment? If not, why can we >>>>prefilter the genes? >>>>2). As I stated in the previous email, we assume that the raw pvalues >>>>and >>>>the top lowest-pvalue genes are the same before (35k genes) and after >>>>gene >>>>filtering (5k genes), the gene x you selected from 35K versus the one >>>>selected from 5K, which is more sound? In other words, the best >>student >>>>selected from 1000 students versus the best student selected from 100, >>>>which is more sound? >>>> >>>>So this is a question of the whole point of gene prefiltering >>approach. >>>> >>>>Best wishes, >>>> >>>>Wayne >>>>-- >>>>> Hi Swapna >>>>> >>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto: >>>>>> Hi Stephanie, >>>>>> There is another recent paper that you might consider which also >>>>>> cautions about filtering >>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010). Filtering, >>FDR >>>>>> and power. BMC Bioinformatics, 11(1), 450. >>>>>> They also recommend their own statistical test to see if one's >>filter >>>>>> biases FDR. >>>>>> currently I am trying variance filter and feature filter from >>>>>> genefilter package: try ?nsFilter for help on these functions. >>>>>> However, I dont use filtering routinely since choosing the right >>>>>> filter , parameters and testing the effects of any bias are things >>I >>>>>> have not worked out in addition to having read Bourgon et al and >>>>>> Iterson et al and others that discuss this issue. >>>>>> About your limma results, while conventional filtering may be >>>>expected >>>>>> to increase the number of significant genes, as the papers suggest >>>>>> likelihood of false positives also increases. >>>>> >>>>> No. Properly applied filtering does not affect the false positive >>>>rates >>>>> (FWER or FDR). That's the whole point of it. [1] >>>>> >>>>> If one is willing to put up with higher rate or probability of false >>>>> discoveries, then don't do filtering - just increase the p-value >>>>cutoff. >>>>> >>>>> [1] Bourgon et al., PNAS 2010. >>>>> >>>>>> In your current results, >>>>>> do you have high fold changes above 2 (log2>1)? You may want to >>>>>> explore the biological relevance of those genes with high FC and >>>>>> significant unadjusted p value. >>>>>> Best, >>>>>> Swapna >>>>> >>>>> Best wishes >>>>> Wolfgang Huber >>>>> EMBL >>>>> http://www.embl.de/research/units/genome_biology/huber >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>>_______________________________________________ >>>>Bioconductor mailing list >>>>Bioconductor at r-project.org >>>>https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>Search the archives: >>>>http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >
ADD COMMENT

Login before adding your answer.

Traffic: 608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6