Limma : post statistical gene filtering

0

Entering edit mode

Stephanie PIERSON ▴ 30

@stephanie-pierson-4671

Last seen 10.2 years ago

Dear bioconductor listers, I am analyzing agilent 2 color microarray data and i choose limma library to make normalization and statistical analysis because i only have 2 replicates per condition and i read in some paper that a moderated t test perform better when there are few replicates. The problem is that when i performed the statistical test on the whole data set ( 35000 probes ),i have no differential expression, ie, all the adjusted p value are comprise between 0.5 and 0.9. So, i have seen on the list that the question on prefiltering genes have already been asked : some people on the list recommand to do the normalization, model fitting, etc, and then filter out before doing the multiplicity adjustment. So, after the statistical analysis, i remove gene with log2FC<2 (ebayes$coefficients), and i perform the FDR. But once again, i have no adj pvalue < 0.05. So, i was wondering on wich criteria i could filter out genes before the multiple testing correction : pvalue ? log2FC ? other criteria ? I have a lot of variabily between replicates, ie, for many genes, i have a fold change <0 in one replicate (for example, -5) and >0 on the other one replicate (for example, 3) ... do you think i should remove those gene before the statistical analysis or i can keep them ? Thank you, Best wishes St?phanie -- St?phanie PIERSON Universite de la Mediterranee (Aix-Marseille II) Master 2 Pro Bioinformatique et G?nomique

Microarray Normalization Microarray Normalization • 2.1k views

ADD COMMENT • link updated 13.5 years ago by Kevin Coombes ▴ 430 • written 13.5 years ago by Stephanie PIERSON ▴ 30

0

Entering edit mode

Kevin Coombes ▴ 430

@kevin-coombes-3935

Last seen 2.1 years ago

United States

If you filter the genes after performing the t-test, then I will not believe the results. Filtering based on any criteria that knows how much the genes differ between the two groups being contrasted (fold change, p-value, etc.) is statistically and scientifically invalid. People have made (and continue to make) the argument that it is safe/sound/reasonable to filter on criteria that do not rely on the results of the statistical test. examples of these kinds of filters are ones that look at the mean (max or some percentile) of the gene expression across the entire data set, or at the variance or range across the entire data set. If you have no differential expression, then filtering is not going to magically create it for you. I would advise one of the following options [1] Rank the genes by the p-value or t-statistic (possibly filtered by fold change) and perform PCR on the top ten to see if any of them can actualy be confirmed. [2] Run more arrays so you have enough replicates to provide adequate power to discover smaller differences in expression than you can expect to find with only two replicates per group. Kevin On 6/16/2011 9:17 AM, Stephanie PIERSON wrote: > Dear bioconductor listers, > > I am analyzing agilent 2 color microarray data and i choose limma > library to make normalization and statistical analysis because i only > have 2 replicates per condition and i read in some paper that a > moderated t test perform better when there are few replicates. > > The problem is that when i performed the statistical test on the whole > data set ( 35000 probes ),i have no differential expression, ie, all > the adjusted p value are comprise between 0.5 and 0.9. So, i have seen > on the list that the question on prefiltering genes have already been > asked : some people on the list recommand to do the normalization, > model fitting, etc, and then filter out before doing the multiplicity > adjustment. > So, after the statistical analysis, i remove gene with log2FC<2 > (ebayes$coefficients), and i perform the FDR. But once again, i have > no adj pvalue < 0.05. > > So, i was wondering on wich criteria i could filter out genes before > the multiple testing correction : pvalue ? log2FC ? other criteria ? > > I have a lot of variabily between replicates, ie, for many genes, i > have a fold change <0 in one replicate (for example, -5) and >0 on the > other one replicate (for example, 3) ... do you think i should remove > those gene before the statistical analysis or i can keep them ? > > > Thank you, > Best wishes > St?phanie > > >

ADD COMMENT • link 13.5 years ago Kevin Coombes ▴ 430

0

Entering edit mode

On Thu, Jun 16, 2011 at 11:23 AM, Kevin R. Coombes <kevin.r.coombes at="" gmail.com=""> wrote: > If you filter the genes after performing the t-test, then I will not believe > the results. ?Filtering based on any criteria that knows how much the genes > differ between the two groups being contrasted (fold change, p-value, etc.) > is statistically and scientifically invalid. > > People have made (and continue to make) the argument that it is > safe/sound/reasonable to filter on criteria that do not rely on the results > of the statistical test. examples of these kinds of filters are ones that > look at the mean (max or some percentile) of the gene expression across the > entire data set, or at the variance or range across the entire data set. > > If you have no differential expression, then filtering is not going to > magically create it for you. ?I would advise one of the following options > [1] Rank the genes by the p-value or t-statistic (possibly filtered by fold > change) and perform PCR on the top ten to see if any of them can actualy be > confirmed. > [2] Run more arrays so you have enough replicates to provide adequate power > to discover smaller differences in expression than you can expect to find > with only two replicates per group. Kevin and I agree on these points. I would add a third option which is to use GSEA-like ideas or gene set tests to look for a signal of differential expression in larger gene sets. Sean > ? ?Kevin > > On 6/16/2011 9:17 AM, Stephanie PIERSON wrote: >> >> Dear bioconductor listers, >> >> I am analyzing agilent 2 color microarray data and i choose limma library >> to make normalization and statistical analysis because i only have 2 >> replicates per condition and i read in some paper that a moderated t test >> perform better when there are few replicates. >> >> The problem is that when i performed the statistical test on the whole >> data set ( 35000 probes ),i have no differential expression, ie, all the >> adjusted p value are comprise between 0.5 and 0.9. So, i have seen on the >> list that the question on prefiltering genes have already been asked : some >> people on the list recommand to do the normalization, model fitting, etc, >> and then filter out before doing the multiplicity adjustment. >> So, after the statistical analysis, i remove gene with log2FC<2 >> (ebayes$coefficients), and i perform the FDR. But once again, i have no adj >> pvalue < 0.05. >> >> So, i was wondering on wich criteria i could filter out genes before the >> multiple testing correction : pvalue ? log2FC ? other criteria ? >> >> I have a lot of variabily between replicates, ie, for many genes, i have a >> fold change <0 in one replicate (for example, -5) and >0 on the other one >> replicate (for example, 3) ... do you think i should remove those gene >> before the statistical analysis or i can keep them ? >> >> >> Thank you, >> Best wishes >> St?phanie >> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.5 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

Hi, Stephanie. On Thu, Jun 16, 2011 at 10:17 AM, Stephanie PIERSON <stephanie.pierson at="" etumel.univmed.fr=""> wrote: > Dear bioconductor listers, > > I am analyzing agilent 2 color microarray data and i choose limma library to > make normalization and statistical analysis because i only have 2 replicates > per condition and i read in some paper that a moderated t test perform > better when there are few replicates. > > The problem is that when i performed the statistical test on the whole data > set ( 35000 probes ),i have no differential expression, ie, all the adjusted > p value are comprise between 0.5 and 0.9. So, i have seen on the list that > the question on prefiltering genes have already been asked : some people on > the list recommand to do the normalization, model fitting, etc, and then > filter out before doing the multiplicity adjustment. > So, after the statistical analysis, i remove gene with log2FC<2 > (ebayes$coefficients), and i perform the FDR. But once again, i have no adj > pvalue < 0.05. Just a note on filtering. You should not filter on any measure that is derived from knowledge of the groupings. In this case, filtering based on ebayes$coefficients is not valid and will result in p-values being incorrect (and falsely significant). As for your data, assuming that your limma analysis is correct, it sounds as if you have no evidence of differential expression. Perhaps your study would benefit from a larger number of samples to improve power? > So, i was wondering on wich criteria i could filter out genes before the > multiple testing correction : pvalue ? log2FC ? other criteria ? There are papers on the subject and several email exchanges on this list (which is searchable), but filtering based on variance across ALL samples (not within groups) is a common technique. The goal is not to pick the lowest variance genes but to pick the top X% of the genes with the highest variance (where X could be about 40-60%, roughly). > I have a lot of variabily between replicates, ie, for many genes, i have a > fold change <0 in one replicate (for example, -5) and >0 on the other one > replicate (for example, 3) ... do you think i should remove those gene > before the statistical analysis or i can keep them ? Again, removing genes that have low variance within groups is not valid and will result in p-values that will be biased (and, therefore, not to be trusted). If you have high biologic variation, you could certainly benefit from a larger sample size. Sean

ADD COMMENT • link 13.5 years ago Sean Davis 21k

0

Entering edit mode

Stephanie PIERSON ▴ 30

@stephanie-pierson-4671

Last seen 10.2 years ago

Hi, Sean, Thanks for your quickly reply ! When you say "variance across ALL samples" you mean filter gene according to IQR ? Is there a function that permit me "to pick the top X% of the genes with the highest variance" ? unfortunately, i'm student in a lab and it's impossible to enlarge sample size ... St?phanie

ADD COMMENT • link 13.5 years ago Stephanie PIERSON ▴ 30

0

Entering edit mode

2011/6/16 St?phanie <stephanie.pierson at="" etumel.univmed.fr="">: > Hi, Sean, > Thanks for your quickly reply ! > > When you say "variance across ALL samples" you mean filter gene according to > IQR ? Is there a function that permit me "to pick the top X% of the genes > with the highest variance" ? The genefilter package is a good place to look for gene filtering. > unfortunately, i'm student in a lab and it's impossible to enlarge sample > size ... That is really too bad, but all-too-common. Sean

ADD REPLY • link 13.5 years ago Sean Davis 21k

Login before adding your answer.