Non-specific filtering for HyperGeometric/GSEA test

0

Entering edit mode

Yuan Hao ▴ 30

@yuan-hao-4071

Last seen 10.4 years ago

Dear list, May I have a question about the non-specific filtering used for defining a gene universe during HyperGeometric/GSEA test? I have fifteen samples from Affymetrix. To remove probe sets that have little variation across samples, I evaluated IQR of each probe set across samples by either of the following two pieces of code: # code one > cutoff <- 0.5 > Iqr <- apply (exprs(eset), 1, IQR) > selected <- (Iqr > cutoff) > filtered <- eset[selected, ] > dim(filtered) Features Samples 11490 15 # code two > library(genefilter) > filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE) > dim(filtered) Features Samples 27337 15 I realized the differences in "filtered" given by above two methods may come from the different definitions of IQR. In the first case, IQR was computed by using the 'quantile' function rather than Tukey's format: ?IQR(x) = quantile(x,3/4) - quantile(x,1/4)?, which was used in the second case. I am aware the fact that the number of genes in the gene universe would has significant effects on the test result. However, I am not sure which IQR evaluation method will be a better choice for the HyperGeometric/GSEA test? It would be appreciated very much if you could shed some light on it! Regards, Yuan

probe probe • 1.0k views

ADD COMMENT • link 14.7 years ago Yuan Hao ▴ 30

0

Entering edit mode

Yuan Hao ▴ 30

@yuan-hao-4071

Last seen 10.4 years ago

Dear Wolfgang, You are absolutely right in that I got the same result by trying your line of code. Only forgot to indicate in the last email that I'd realized the 'quantile' interpretation in "varFilter". I tried turning off 'filterByQuantile' attribute (shown in the following code #3) in "varFilter", but still got pretty different results compared to code #1, which made me confused. Your explanation about the "rowIQRs" right hits on my confusion and actually resolved the question. Thank you very much again! # code three > filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5,filterByQuantile=FALSE) > dim(filtered) Features Samples 18634 15 Best wishes, Yuan On 11 May 2010, at 23:50, Wolfgang Huber wrote: Dear Yuan have a look into the manual page of "varFilter", which indicates that its 'var.cutoff' argument is interpreted as the quantile of the overall distribution of variances to be used as cutoff; whereas in your "code one" the "cutoff" is interpreted as the actual variance value to be used for the cutoff. Try with selected <- (Iqr > quantile(Iqr, probs=cutoff)) the result of this should be nearly the same as with "code 2". Why only "nearly"? You are right that "varFilter" does something odd when "var.func = IQR", namely it calls "rowIQRs", which runs a little bit faster, but produces a different result; you can verify this by typing "varFilter" and reading its code. (One might argue that the effort of understanding what this function does exceeds the effort of doing it from scratch...) So, both code versions should produce nearly identical results, and the results of the downstream analysis (GSEA) should not depend sensitively on this. Best wishes Wolfgang On 11/05/10 01:41, Yuan Hao wrote: Dear list, May I have a question about the non-specific filtering used for defining a gene universe during HyperGeometric/GSEA test? I have fifteen samples from Affymetrix. To remove probe sets that have little variation across samples, I evaluated IQR of each probe set across samples by either of the following two pieces of code: # code one cutoff<- 0.5 Iqr<- apply (exprs(eset), 1, IQR) selected<- (Iqr> cutoff) filtered<- eset[selected, ] dim(filtered) Features Samples 11490 15 # code two library(genefilter) filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE) dim(filtered) Features Samples 27337 15 I realized the differences in "filtered" given by above two methods may come from the different definitions of IQR. In the first case, IQR was computed by using the 'quantile' function rather than Tukey's format: ?IQR(x) = quantile(x,3/4) - quantile(x,1/4)?, which was used in the second case. I am aware the fact that the number of genes in the gene universe would has significant effects on the test result. However, I am not sure which IQR evaluation method will be a better choice for the HyperGeometric/GSEA test? It would be appreciated very much if you could shed some light on it! Regards, Yuan _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.7 years ago Yuan Hao ▴ 30

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 5 months ago

EMBL European Molecular Biology Laborat…

Dear Yuan have a look into the manual page of "varFilter", which indicates that its 'var.cutoff' argument is interpreted as the quantile of the overall distribution of variances to be used as cutoff; whereas in your "code one" the "cutoff" is interpreted as the actual variance value to be used for the cutoff. Try with selected <- (Iqr > quantile(Iqr, probs=cutoff)) the result of this should be nearly the same as with "code 2". Why only "nearly"? You are right that "varFilter" does something odd when "var.func = IQR", namely it calls "rowIQRs", which runs a little bit faster, but produces a different result; you can verify this by typing "varFilter" and reading its code. (One might argue that the effort of understanding what this function does exceeds the effort of doing it from scratch...) So, both code versions should produce nearly identical results, and the results of the downstream analysis (GSEA) should not depend sensitively on this. Best wishes Wolfgang On 11/05/10 01:41, Yuan Hao wrote: > Dear list, > > May I have a question about the non-specific filtering used for defining a > gene universe during HyperGeometric/GSEA test? > > I have fifteen samples from Affymetrix. To remove probe sets that have > little variation across samples, I evaluated IQR of each probe set across > samples by either of the following two pieces of code: > > # code one >> cutoff<- 0.5 >> Iqr<- apply (exprs(eset), 1, IQR) >> selected<- (Iqr> cutoff) >> filtered<- eset[selected, ] >> dim(filtered) > Features Samples > 11490 15 > > # code two >> library(genefilter) >> filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5, > filterByQuantile=TRUE) >> dim(filtered) > Features Samples > 27337 15 > > I realized the differences in "filtered" given by above two methods may > come from the different definitions of IQR. In the first case, IQR was > computed by using the 'quantile' function rather than Tukey's format: > ?IQR(x) = quantile(x,3/4) - quantile(x,1/4)?, which was used in the second > case. I am aware the fact that the number of genes in the gene universe > would has significant effects on the test result. However, I am not sure > which IQR evaluation method will be a better choice for the > HyperGeometric/GSEA test? It would be appreciated very much if you could > shed some light on it! > > Regards, > Yuan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 14.7 years ago Wolfgang Huber ★ 13k

Login before adding your answer.