Hi Stephanie,
There is another recent paper that you might consider which also
cautions about filtering
Van Iterson, M., Boer, J. M., & Menezes, R. X. (2010). Filtering, FDR
and power. BMC Bioinformatics, 11(1), 450.
They also recommend their own statistical test to see if one's filter
biases FDR.
currently I am trying variance filter and feature filter from
genefilter package: try ?nsFilter for help on these functions.
However, I dont use filtering routinely since choosing the right
filter , parameters and testing the effects of any bias are things I
have not worked out in addition to having read Bourgon et al and
Iterson et al and others that discuss this issue.
About your limma results, while conventional filtering may be expected
to increase the number of significant genes, as the papers suggest
likelihood of false positives also increases.In your current results,
do you have high fold changes above 2 (log2>1)? You may want to
explore the biological relevance of those genes with high FC and
significant unadjusted p value.
Best,
Swapna
>
> Message: 2
> Date: Wed, 01 Jun 2011 14:58:47 +0200
> From: Stephanie PIERSON <stephanie.pierson at="" etumel.univmed.fr="">
> To: bioconductor at r-project.org
> Subject: [BioC] PreFiltering probe in microarray analysis
> Message-ID: <20110601145847.72402aos7e2ne074 at
wmeletud2.univmed.fr>
> Content-Type: text/plain; charset=ISO-8859-15; DelSp="Yes";
> ? ? ? ?format="flowed"
>
> Hello everybody,
>
> I am a french student in bioinformatic. I have to analyze microarray
> data and I have some questions about prefiltering genes.
> The dataset that I have to analyze consist in 8 microarray, i have 4
> times points and 2 replicats for each time point. Agilent's two
color
> microarray ?(Whole Mouse Genome (4x44K) Oligo Microarrays) were used
> for the analysis. We are searching for genes that are differentially
> expressed between two conditions (for example C1 and C2) at the
> different time points and genes that are differentially expressed in
> one condition (C1 or C2) over time .
> I have chosen LIMMA to perform the statistical analysis because I
read
> in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC
> bioinformatic 2006,7/359 ?) that it work better in experiment with
few
> replicate per conditions.
> I perfom the statistical analysis on the whole data set ( more than
37
> 000 genes ), but I have high corrected p value after multiple
testing
> correction (benjamini hochberg ). I would like to prefilter genes
> before statistical analysis, but I don't know how to do this. I read
> in Bourgon's paper that we can filter on the overall variance or on
> the overall mean, but in my case, with few replicates, how can I do
?
> In more, in this paper, it is not recommended to combine limma with
a
> filtering procedure ...
> Someone can help me please ?
>
> Thank you,
> Best wishes
> St?phanie
>
>
>
> --
> St?phanie PIERSON
> Universite de la Mediterranee (Aix-Marseille II)
> Master 2 Pro Bioinformatique et G?nomique
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 1 Jun 2011 14:31:48 +0100
> From: Yuan Hao <yuan.x.hao at="" gmail.com="">
> To: Stephanie PIERSON <stephanie.pierson at="" etumel.univmed.fr="">
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] PreFiltering probe in microarray analysis
> Message-ID: <e177da51-396f-4bd3-a2c9-818c6fc2c76f at="" gmail.com="">
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed;
delsp=yes
>
> Hi Stephanie,
>
> You can have a look the 'genefilter' package in R/bioconductor.
> Basically, it's easy to set up a overall variance filter, for
example
> if you have a data set normalized by gcrma and you require all
> probesets having an IQR bigger than 0.5, you can do:
>
> ?> library(affy)
> ?> library(genefilter)
> ?> library(gcrma)
> ?> eset <- gcrma(data)
> ?> f <- function(x)(IQR(x)>0.5)
> ?> selected <- genefilter (eset, f)
> ?> eset.filtered <- eset[selected, ]
>
> You may have to be careful about the filtering on your data. It
quiet
> depends on the characters of your data. There is a paper[1] having
had
> a very good review about this, which doesn't really recommend an
> overall variance filtering combined with Limma.
>
> Cheers,
> Yuan
>
> [1] R. Bourgon, R. Gentleman and W. Huber. PNAS 2010. p9546-9551
>
> On 1 Jun 2011, at 13:58, Stephanie PIERSON wrote:
>
>> Hello everybody,
>>
>> I am a french student in bioinformatic. I have to analyze
microarray
>> data and I have some questions about prefiltering genes.
>> The dataset that I have to analyze consist in 8 microarray, i have
4
>> times points and 2 replicats for each time point. Agilent's two
>> color microarray ?(Whole Mouse Genome (4x44K) Oligo Microarrays)
>> were used for the analysis. We are searching for genes that are
>> differentially expressed between two conditions (for example C1 and
>> C2) at the different time points and genes that are differentially
>> expressed in one condition (C1 or C2) over time .
>> I have chosen LIMMA to perform the statistical analysis because I
>> read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC
>> bioinformatic 2006,7/359 ?) that it work better in experiment with
>> few replicate per conditions.
>> I perfom the statistical analysis on the whole data set ( more than
>> 37 000 genes ), but I have high corrected p value after multiple
>> testing correction (benjamini hochberg ). I would like to prefilter
>> genes before statistical analysis, but I don't know how to do this.
>> I read in Bourgon's paper that we can filter on the overall
variance
>> or on the overall mean, but in my case, with few replicates, how
can
>> I do ? In more, in this paper, it is not recommended to combine
>> limma with a filtering procedure ...
>> Someone can help me please ?
>>
>> Thank you,
>> Best wishes
>> St?phanie
>>
>>
>>
>> --
>> St?phanie PIERSON
>> Universite de la Mediterranee (Aix-Marseille II)
>> Master 2 Pro Bioinformatique et G?nomique
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 27 May 2011 21:59:41 +0000
> From: Ilya Lipkovich <ilya.lipkovich at="" gmail.com="">
> To: <bioconductor at="" stat.math.ethz.ch="">
> Subject: Re: [BioC] R package modreg
> Message-ID: <loom.20110527t235533-145 at="" post.gmane.org="">
> Content-Type: text/plain; charset="us-ascii"
>
> seems modreg has been merged into stats which is included in R
>
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 30 May 2011 10:17:54 -0700 (PDT)
> From: Les Dethlefsen <dethlefs at="" stanford.edu="">
> To: bioconductor at r-project.org
> Subject: [BioC] question/issue with multtest mt.maxT
> Message-ID:
> ? ? ? ?<570911820.845739.1306775874471.JavaMail.root at
zm08.stanford.edu>
> Content-Type: text/plain; charset=utf-8
>
> Hi All,
>
> I'm a biologist, not a statistician, and a bit of a newbie with R,
so please forgive me if this is a stupid question, and thanks for your
help.
>
> My data are from shotgun metagenomic sequencing of multiple samples,
but it's essentially similar to microarray data. ?I have a matrix of
3312 functional gene categories by 24 samples, with real-valued
entries indicating the proportion of interpretable gene sequence data
for each sample that is assigned to each functional category. ?There
are a number of ways to bin the samples into categories, based on some
other analysis I'm most interested in a breakdown into two subjects
and 5 temporal categories within each subject, i.e. a total of 10
categories. ?Two of the categories have 4 samples each, the remaining
8 categories have 2 samples each.
>
> When I ask mt.maxT to carry out a 2-sided F test on the 10
categories, I see something I don't understand: the adjusted p values
are not a monotonic transformation of the raw p values. ?They are a
monotonic with respect to the test statistic...as the test statistic
falls, the adjusted p values rise, as expected. ?But the raw p values
are bouncing around quite a bit, with the lowest raw p values of
0.0001 having a range of adjusted p values, all much larger than the
minimum adjusted p value. ?There are 244 raw p values lower than that
which corresponds to the minimum adjusted p value.
>
> My (perhaps mistaken) understanding was that permutations were used
to generate the raw p values (which I definitely want, since I don't
think the distributions of values for the different functional gene
categories will follow any regular distribution or be similar to each
other), and then some sort of adjustment was made to the raw values to
adjust for multiple testing considerations, but the adjustment itself
was not based on permutations. ?I would expect the permutations to be
fairly noisy in my case with only 2 samples in most of my sample
categories, but wouldn't that only affect the reliability of the raw p
values? ?How can the adjusted p values have a different rank order
than the raw p values?
>
> I did run the same command with the samples binned only to subject,
i.e. only 2 categories of 12 subject each. While the range of p values
is much lower (not surprisingly) there is still the same pattern with
a monotonic relationship between the test statistic and the adjusted p
values, but not between the raw p values and adjusted p values.
?Hence, it doesn't seem that this behavior is due to the messiness of
10 unequally sized categories for only 24 samples.
>
> I'm including some snips of my R session below; if anyone is
interested in playing with the original data there's a compressed file
at
>
>
ftp://asiago.stanford.edu/multtest_issue.tgz
>
> that has the data matrix and the class labels.
>
> I'm running R 2.13.0 (on Mac OSX with the R.app GUI) with newly-
installed BioConductor, so everything should be up to date.
>
> Thanks for any help or insight!
> Les
>
>
> Les Dethlefsen
> Relman Lab
> Microbiology & Immunology
> Stanford University
>
> R version 2.13.0 (2011-04-13)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> [R.app GUI 1.40 (5751) x86_64-apple-darwin9.8.0]
>
>> ko.data.df=as.data.frame(read.table("K57.fvn.ko.txt",header=TRUE,se
p="",row.names=1))
>> class.label.df=as.data.frame(t(read.table("ClassLabel2.txt",header=
FALSE,sep="",row.names=1)))
>> class.label.df
> ? ?Sub CpNon Cp12Non CpPIP Cp12PIP Cp12PELP SubCp12PIP
> V2 ? ?0 ? ? 0 ? ? ? 0 ? ? 0 ? ? ? 0 ? ? ? ?0 ? ? ? ? ?0
> V3 ? ?0 ? ? 0 ? ? ? 0 ? ? 0 ? ? ? 0 ? ? ? ?0 ? ? ? ? ?0
> V4 ? ?0 ? ? 1 ? ? ? 1 ? ? 1 ? ? ? 1 ? ? ? ?1 ? ? ? ? ?1
> V5 ? ?0 ? ? 1 ? ? ? 1 ? ? 1 ? ? ? 1 ? ? ? ?1 ? ? ? ? ?1
> V6 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?2 ? ? ? ? ?2
> V7 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?2 ? ? ? ? ?2
> V8 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?3 ? ? ? ? ?2
> V9 ? ?0 ? ? 0 ? ? ? 0 ? ? 2 ? ? ? 2 ? ? ? ?3 ? ? ? ? ?2
> V10 ? 0 ? ? 1 ? ? ? 2 ? ? 1 ? ? ? 3 ? ? ? ?4 ? ? ? ? ?3
> V11 ? 0 ? ? 1 ? ? ? 2 ? ? 1 ? ? ? 3 ? ? ? ?4 ? ? ? ? ?3
> V12 ? 0 ? ? 0 ? ? ? 0 ? ? 3 ? ? ? 4 ? ? ? ?5 ? ? ? ? ?4
> V13