Very low P-values in limma
6
0
Entering edit mode
Paul Geeleher ★ 1.3k
@paul-geeleher-2679
Last seen 10.4 years ago
Hi folks, I'm analyzing microRNA data using limma and I'm wondering about the validity of the p-values I'm getting out. Its a simple 'Group A Vs Group B' experimental design. 4 arrays in one group, 3 in the other and 4 duplicate spots for each miRNA on each array. The lowest adjusted p-values in the differential expression analysis are in the region of 10^-7. Its been pointed out to me that plugging the point values from each sample into a regular t-test you get p=0.008, which then also needs to take the multiple test hit. Can anybody explain why limma is giving me such lower values and if they are valid? I can provide more information if required. Thanks, Paul. -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland [[alternative HTML version deleted]]
miRNA limma microRNA miRNA limma microRNA • 3.0k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 23 days ago
United States
This is not a very precise description of the activity. Are you sure you are looking at the right coefficient for your test? If you do topTable(fit) without specifying the coef you may be looking at p-values for the wrong test. Read the topTable documentation carefully and be sure you have set up the design matrix correctly. On Wed, Oct 21, 2009 at 1:36 PM, Paul Geeleher <paulgeeleher at="" gmail.com=""> wrote: > Hi folks, I'm analyzing microRNA data using limma and I'm wondering about > the validity of the p-values I'm getting out. Its a simple 'Group A Vs Group > B' experimental design. 4 arrays in one group, 3 in the other and 4 > duplicate spots for each miRNA on each array. > > The lowest adjusted p-values in the differential expression analysis are in > the region of 10^-7. > > Its been pointed out to me that plugging the point values from each sample > into a regular t-test you get p=0.008, which then also needs to take the > multiple test hit. Can anybody explain why limma is giving me such lower > values and if they are valid? > > I can provide more information if required. > > Thanks, > > Paul. > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
@elmer-fernandez-3744
Last seen 10.4 years ago
Dear Paul You shouild take into account that you are doing very noisy statistics in first place (a common problem in highthrouput genomics), Did you apply eBayes? from limma Elmer On Wed, Oct 21, 2009 at 3:36 PM, Paul Geeleher <paulgeeleher@gmail.com>wrote: > Hi folks, I'm analyzing microRNA data using limma and I'm wondering about > the validity of the p-values I'm getting out. Its a simple 'Group A Vs > Group > B' experimental design. 4 arrays in one group, 3 in the other and 4 > duplicate spots for each miRNA on each array. > > The lowest adjusted p-values in the differential expression analysis are in > the region of 10^-7. > > Its been pointed out to me that plugging the point values from each sample > into a regular t-test you get p=0.008, which then also needs to take the > multiple test hit. Can anybody explain why limma is giving me such lower > values and if they are valid? > > I can provide more information if required. > > Thanks, > > Paul. > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Elmer A. Fernández (Bioing. PhD) Investigador Asistente CONICET - Research Assistant CONICET Porf. Inteligencia Artificial -UCC - Prof. Artificial Intelligence @ UCC tel: +54-(0)351-4938000 int 145 Fax: +54-(0)351-4938081 web page : http://www.uccor.edu.ar/modelo.php?param=3.8.5.15 mail address: Camino Alta Gracia Km 7.1/2- Córdoba-5000-Argentina [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@thomas-hampton-2820
Last seen 10.4 years ago
I too find it very odd that a multiple hypothesis correction apparently reduces P values. One possibility is that the duplicate spot weighting in limma is somehow doing a really superior job compared to a simple t test that might view duplicates on different arrays as the same thing as duplicates on the same array. Tom On Oct 21, 2009, at 1:36 PM, Paul Geeleher wrote: > Hi folks, I'm analyzing microRNA data using limma and I'm wondering > about > the validity of the p-values I'm getting out. Its a simple 'Group A > Vs Group > B' experimental design. 4 arrays in one group, 3 in the other and 4 > duplicate spots for each miRNA on each array. > > The lowest adjusted p-values in the differential expression analysis > are in > the region of 10^-7. > > Its been pointed out to me that plugging the point values from each > sample > into a regular t-test you get p=0.008, which then also needs to take > the > multiple test hit. Can anybody explain why limma is giving me such > lower > values and if they are valid? > > I can provide more information if required. > > Thanks, > > Paul. > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 5 months ago
EMBL European Molecular Biology Laborat…
Dear Paul can you provide a more specific description of what you did than "analyzing ... data using limma"? The posting guide makes good suggestions. There is a possibility that this is related to the eBayes procedure (which makes an assumption on the distribution of the within-group variances across the experiment). What is the value of the element "df.prior" of the object returned by eBayes? Paul Geeleher ha scritto: > Hi folks, I'm analyzing microRNA data using limma and I'm wondering about > the validity of the p-values I'm getting out. Its a simple 'Group A Vs Group > B' experimental design. 4 arrays in one group, 3 in the other and 4 > duplicate spots for each miRNA on each array. > > The lowest adjusted p-values in the differential expression analysis are in > the region of 10^-7. > > Its been pointed out to me that plugging the point values from each sample > into a regular t-test you get p=0.008, which then also needs to take the > multiple test hit. Can anybody explain why limma is giving me such lower > values and if they are valid? > > I can provide more information if required. > > Thanks, > > Paul. > -- Best wishes Wolfgang ------------------------------------------------------- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD COMMENT
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 5 weeks ago
Australia/Melbourne
Dear Paul: The low p-values you have got are not surprising to me. I have got even lower FDR adjusted p-values than yours. This just means you got significantly differentially expressed genes. But on the other hand, I did see much higher adjusted p-values in some of my microarray analyses. In that case, I will explore the data in more depth such as looking at batch effect etc. Cheers, Wei Paul Geeleher wrote: > Hi folks, I'm analyzing microRNA data using limma and I'm wondering about > the validity of the p-values I'm getting out. Its a simple 'Group A Vs Group > B' experimental design. 4 arrays in one group, 3 in the other and 4 > duplicate spots for each miRNA on each array. > > The lowest adjusted p-values in the differential expression analysis are in > the region of 10^-7. > > Its been pointed out to me that plugging the point values from each sample > into a regular t-test you get p=0.008, which then also needs to take the > multiple test hit. Can anybody explain why limma is giving me such lower > values and if they are valid? > > I can provide more information if required. > > Thanks, > > Paul. > >
ADD COMMENT
0
Entering edit mode
The value of df.prior is 3.208318. Hopefully somebody can help me here because I'm still really at a loss on why these values are so low. Here's the script. I'm fairly sure it conforms to the documentation: library(affy) setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') # create a color pallete for boxplots cols <- brewer.pal(12, "Set3") # This defines the column name of the mean Cy3 foreground intensites #Cy3 <- "F532 Median" # background corrected value Cy3 <- "F532 Median - B532" # This defines the column name of the mean Cy3 background intensites Cy3b <- "B532 Mean" # Read the targets file (see limmaUsersGuide for info on how to create this) targets <- readTargets("targets.csv") # read values from gpr file # because there is no red channel, read Rb & R to be the same values as G and Gb # this is a type of hack that allows the function to work RG <- read.maimages( targets$FileName, source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) # remove the extraneous values red channel values RG$R <- NULL RG$Rb <- NULL # this line of code will remove any negative values which cause errors further on RG$G[RG$G<0] <- 0 # create a pData for the data just read # this indicates which population each array belongs to # a are "her-" and b are "her+" (almost certain) pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) rownames(pData) <- RG$targets$FileName # create design matrix design <- model.matrix(~factor(pData$population)) # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in their name # so the grep function can be used to extract all of these, removing # all control signals and printing buffers etc. # You need to check your .gpr files to find which signals you should extract. miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) RG.final <- RG[miRs, ] # load vsn library, this contains the normalization functions library('vsn') # Do VSN normalization and output as vns object mat <- vsnMatrix(RG.final$G) # insert rownames into matrix with normalized data # this will mean that the gene names will appear in our final output rownames(mat@hx) <- RG.final$genes$Name # my .gpr files contain 4 "within-array replicates" of each miRNA. # We need to make full use of this information by calculating the duplicate correlation # in order to use the duplicateCorrelation() function below, # we need to make sure that the replicates of each gene appear in sequence in this matrix, # so we sort the normalized values so the replicate groups of 4 miRs appear in sequence mat@hx <- mat@hx[order(rownames(mat@hx)), ] # calculate duplicate correlation between the 4 replicates on each array corfit <- duplicateCorrelation(mat, design, ndups=4) # fit the linear model, including info on duplicates and correlation fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) # calculate values using ebayes ebayes <- eBayes(fit) # output a list of top differnetially expressed genes... topTable(ebayes, coef = 2, adjust = "BH", n = 10) -Paul. On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi@wehi.edu.au> wrote: > Dear Paul: > > The low p-values you have got are not surprising to me. I have got even > lower FDR adjusted p-values than yours. This just means you got > significantly differentially expressed genes. But on the other hand, I did > see much higher adjusted p-values in some of my microarray analyses. In that > case, I will explore the data in more depth such as looking at batch effect > etc. > > Cheers, > Wei > > > Paul Geeleher wrote: > >> Hi folks, I'm analyzing microRNA data using limma and I'm wondering about >> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >> Group >> B' experimental design. 4 arrays in one group, 3 in the other and 4 >> duplicate spots for each miRNA on each array. >> >> The lowest adjusted p-values in the differential expression analysis are >> in >> the region of 10^-7. >> >> Its been pointed out to me that plugging the point values from each sample >> into a regular t-test you get p=0.008, which then also needs to take the >> multiple test hit. Can anybody explain why limma is giving me such lower >> values and if they are valid? >> >> I can provide more information if required. >> >> Thanks, >> >> Paul. >> >> >> > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Dear Paul please do not do something like: # this line of code will remove any negative values which cause errors # further on RG$G[RG$G<0] <- 0 First, vsn is intended to work with data that is not mutilated like that. There is ample literature on this topic. The vsn vignette is a good start. Second, when you use a parametric test (which is what lmFit/eBayes/topTable do), do not arbitrarily replace parts of your data with phantasy values, it can invalidate the assumptions underlying the p-value computation. I am not sure whether this already fixes all your problems, but this is certainly one of them. Best wishes Wolfgang Paul Geeleher ha scritto: > The value of df.prior is 3.208318. Hopefully somebody can help me here > because I'm still really at a loss on why these values are so low. > > Here's the script. I'm fairly sure it conforms to the documentation: > > library(affy) > > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') > > # create a color pallete for boxplots > cols <- brewer.pal(12, "Set3") > > # This defines the column name of the mean Cy3 foreground intensites > #Cy3 <- "F532 Median" > # background corrected value > Cy3 <- "F532 Median - B532" > > # This defines the column name of the mean Cy3 background intensites > Cy3b <- "B532 Mean" > > # Read the targets file (see limmaUsersGuide for info on how to create this) > targets <- readTargets("targets.csv") > > # read values from gpr file > # because there is no red channel, read Rb & R to be the same values as G > and Gb > # this is a type of hack that allows the function to work > RG <- read.maimages( targets$FileName, > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) > > # remove the extraneous values red channel values > RG$R <- NULL > RG$Rb <- NULL > > # this line of code will remove any negative values which cause errors > further on > RG$G[RG$G<0] <- 0 > > # create a pData for the data just read > # this indicates which population each array belongs to > # a are "her-" and b are "her+" (almost certain) > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) > rownames(pData) <- RG$targets$FileName > > # create design matrix > design <- model.matrix(~factor(pData$population)) > > # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in their > name > # so the grep function can be used to extract all of these, removing > # all control signals and printing buffers etc. > # You need to check your .gpr files to find which signals you should > extract. > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) > RG.final <- RG[miRs, ] > > # load vsn library, this contains the normalization functions > library('vsn') > > # Do VSN normalization and output as vns object > mat <- vsnMatrix(RG.final$G) > > # insert rownames into matrix with normalized data > # this will mean that the gene names will appear in our final output > rownames(mat at hx) <- RG.final$genes$Name > > # my .gpr files contain 4 "within-array replicates" of each miRNA. > # We need to make full use of this information by calculating the duplicate > correlation > # in order to use the duplicateCorrelation() function below, > # we need to make sure that the replicates of each gene appear in sequence > in this matrix, > # so we sort the normalized values so the replicate groups of 4 miRs appear > in sequence > mat at hx <- mat at hx[order(rownames(mat at hx)), ] > > # calculate duplicate correlation between the 4 replicates on each array > corfit <- duplicateCorrelation(mat, design, ndups=4) > > # fit the linear model, including info on duplicates and correlation > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) > > # calculate values using ebayes > ebayes <- eBayes(fit) > > # output a list of top differnetially expressed genes... > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > > -Paul. > > > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Dear Paul: >> >> The low p-values you have got are not surprising to me. I have got even >> lower FDR adjusted p-values than yours. This just means you got >> significantly differentially expressed genes. But on the other hand, I did >> see much higher adjusted p-values in some of my microarray analyses. In that >> case, I will explore the data in more depth such as looking at batch effect >> etc. >> >> Cheers, >> Wei >> >> >> Paul Geeleher wrote: >> >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering about >>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >>> Group >>> B' experimental design. 4 arrays in one group, 3 in the other and 4 >>> duplicate spots for each miRNA on each array. >>> >>> The lowest adjusted p-values in the differential expression analysis are >>> in >>> the region of 10^-7. >>> >>> Its been pointed out to me that plugging the point values from each sample >>> into a regular t-test you get p=0.008, which then also needs to take the >>> multiple test hit. Can anybody explain why limma is giving me such lower >>> values and if they are valid? >>> >>> I can provide more information if required. >>> >>> Thanks, >>> >>> Paul. >>> >>> >>> > > -- Best wishes Wolfgang ------------------------------------------------------- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD REPLY
0
Entering edit mode
Hi Wolfgang, Thanks alot for your reply, I removed the offending line and the p-values are still much the same. Actually the reason I had that line in was that I generated a plot (PCA I think) that wouldn't accept negative values. -Paul. On Thu, Oct 22, 2009 at 1:21 PM, Wolfgang Huber <whuber@embl.de> wrote: > Dear Paul > > please do not do something like: > > # this line of code will remove any negative values which cause errors > # further on > RG$G[RG$G<0] <- 0 > > First, vsn is intended to work with data that is not mutilated like that. > There is ample literature on this topic. The vsn vignette is a good start. > > Second, when you use a parametric test (which is what lmFit/eBayes/topTable > do), do not arbitrarily replace parts of your data with phantasy values, it > can invalidate the assumptions underlying the p-value computation. > > I am not sure whether this already fixes all your problems, but this is > certainly one of them. > > Best wishes > Wolfgang > > > > Paul Geeleher ha scritto: > > The value of df.prior is 3.208318. Hopefully somebody can help me here >> because I'm still really at a loss on why these values are so low. >> >> Here's the script. I'm fairly sure it conforms to the documentation: >> >> library(affy) >> >> setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') >> >> # create a color pallete for boxplots >> cols <- brewer.pal(12, "Set3") >> >> # This defines the column name of the mean Cy3 foreground intensites >> #Cy3 <- "F532 Median" >> # background corrected value >> Cy3 <- "F532 Median - B532" >> >> # This defines the column name of the mean Cy3 background intensites >> Cy3b <- "B532 Mean" >> >> # Read the targets file (see limmaUsersGuide for info on how to create >> this) >> targets <- readTargets("targets.csv") >> >> # read values from gpr file >> # because there is no red channel, read Rb & R to be the same values as G >> and Gb >> # this is a type of hack that allows the function to work >> RG <- read.maimages( targets$FileName, >> source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) >> >> # remove the extraneous values red channel values >> RG$R <- NULL >> RG$Rb <- NULL >> >> # this line of code will remove any negative values which cause errors >> further on >> RG$G[RG$G<0] <- 0 >> >> # create a pData for the data just read >> # this indicates which population each array belongs to >> # a are "her-" and b are "her+" (almost certain) >> pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) >> rownames(pData) <- RG$targets$FileName >> >> # create design matrix >> design <- model.matrix(~factor(pData$population)) >> >> # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in >> their >> name >> # so the grep function can be used to extract all of these, removing >> # all control signals and printing buffers etc. >> # You need to check your .gpr files to find which signals you should >> extract. >> miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) >> RG.final <- RG[miRs, ] >> >> # load vsn library, this contains the normalization functions >> library('vsn') >> >> # Do VSN normalization and output as vns object >> mat <- vsnMatrix(RG.final$G) >> >> # insert rownames into matrix with normalized data >> # this will mean that the gene names will appear in our final output >> rownames(mat@hx) <- RG.final$genes$Name >> >> # my .gpr files contain 4 "within-array replicates" of each miRNA. >> # We need to make full use of this information by calculating the >> duplicate >> correlation >> # in order to use the duplicateCorrelation() function below, >> # we need to make sure that the replicates of each gene appear in sequence >> in this matrix, >> # so we sort the normalized values so the replicate groups of 4 miRs >> appear >> in sequence >> mat@hx <- mat@hx[order(rownames(mat@hx)), ] >> >> # calculate duplicate correlation between the 4 replicates on each array >> corfit <- duplicateCorrelation(mat, design, ndups=4) >> >> # fit the linear model, including info on duplicates and correlation >> fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) >> >> # calculate values using ebayes >> ebayes <- eBayes(fit) >> >> # output a list of top differnetially expressed genes... >> topTable(ebayes, coef = 2, adjust = "BH", n = 10) >> >> >> -Paul. >> >> >> On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi@wehi.edu.au> wrote: >> >> Dear Paul: >>> >>> The low p-values you have got are not surprising to me. I have got even >>> lower FDR adjusted p-values than yours. This just means you got >>> significantly differentially expressed genes. But on the other hand, I >>> did >>> see much higher adjusted p-values in some of my microarray analyses. In >>> that >>> case, I will explore the data in more depth such as looking at batch >>> effect >>> etc. >>> >>> Cheers, >>> Wei >>> >>> >>> Paul Geeleher wrote: >>> >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering >>>> about >>>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >>>> Group >>>> B' experimental design. 4 arrays in one group, 3 in the other and 4 >>>> duplicate spots for each miRNA on each array. >>>> >>>> The lowest adjusted p-values in the differential expression analysis are >>>> in >>>> the region of 10^-7. >>>> >>>> Its been pointed out to me that plugging the point values from each >>>> sample >>>> into a regular t-test you get p=0.008, which then also needs to take the >>>> multiple test hit. Can anybody explain why limma is giving me such lower >>>> values and if they are valid? >>>> >>>> I can provide more information if required. >>>> >>>> Thanks, >>>> >>>> Paul. >>>> >>>> >>>> >>>> >> >> > -- > > Best wishes > Wolfgang > > ------------------------------------------------------- > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > ------------------------------------------------------- > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Paul how does hist(topTable(ebayes, coef=2, adjust="BH", n=nrow(ebayes))$P.Value, breaks=100) look like? Best wishes Wolfgang Paul Geeleher ha scritto: > Hi Wolfgang, > > Thanks alot for your reply, I removed the offending line and the > p-values are still much the same. Actually the reason I had that line in > was that I generated a plot (PCA I think) that wouldn't accept negative > values. > > -Paul. > > > > On Thu, Oct 22, 2009 at 1:21 PM, Wolfgang Huber <whuber at="" embl.de=""> <mailto:whuber at="" embl.de="">> wrote: > > Dear Paul > > please do not do something like: > > > # this line of code will remove any negative values which cause errors > # further on > RG$G[RG$G<0] <- 0 > > First, vsn is intended to work with data that is not mutilated like > that. There is ample literature on this topic. The vsn vignette is a > good start. > > Second, when you use a parametric test (which is what > lmFit/eBayes/topTable do), do not arbitrarily replace parts of your > data with phantasy values, it can invalidate the assumptions > underlying the p-value computation. > > I am not sure whether this already fixes all your problems, but this > is certainly one of them. > > Best wishes > Wolfgang > > > > Paul Geeleher ha scritto: > > The value of df.prior is 3.208318. Hopefully somebody can help > me here > because I'm still really at a loss on why these values are so low. > > Here's the script. I'm fairly sure it conforms to the documentation: > > library(affy) > > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') > > # create a color pallete for boxplots > cols <- brewer.pal(12, "Set3") > > # This defines the column name of the mean Cy3 foreground intensites > #Cy3 <- "F532 Median" > # background corrected value > Cy3 <- "F532 Median - B532" > > # This defines the column name of the mean Cy3 background intensites > Cy3b <- "B532 Mean" > > # Read the targets file (see limmaUsersGuide for info on how to > create this) > targets <- readTargets("targets.csv") > > # read values from gpr file > # because there is no red channel, read Rb & R to be the same > values as G > and Gb > # this is a type of hack that allows the function to work > RG <- read.maimages( targets$FileName, > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) > > # remove the extraneous values red channel values > RG$R <- NULL > RG$Rb <- NULL > > # this line of code will remove any negative values which cause > errors > further on > RG$G[RG$G<0] <- 0 > > # create a pData for the data just read > # this indicates which population each array belongs to > # a are "her-" and b are "her+" (almost certain) > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', > 'b')) > rownames(pData) <- RG$targets$FileName > > # create design matrix > design <- model.matrix(~factor(pData$population)) > > # In my .gpr files all miRNAs contain the string "-miR-" or > "-let-" in their > name > # so the grep function can be used to extract all of these, removing > # all control signals and printing buffers etc. > # You need to check your .gpr files to find which signals you should > extract. > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", > RG$genes$Name)) > RG.final <- RG[miRs, ] > > # load vsn library, this contains the normalization functions > library('vsn') > > # Do VSN normalization and output as vns object > mat <- vsnMatrix(RG.final$G) > > # insert rownames into matrix with normalized data > # this will mean that the gene names will appear in our final output > rownames(mat at hx) <- RG.final$genes$Name > > # my .gpr files contain 4 "within-array replicates" of each miRNA. > # We need to make full use of this information by calculating > the duplicate > correlation > # in order to use the duplicateCorrelation() function below, > # we need to make sure that the replicates of each gene appear > in sequence > in this matrix, > # so we sort the normalized values so the replicate groups of 4 > miRs appear > in sequence > mat at hx <- mat at hx[order(rownames(mat at hx)), ] > > # calculate duplicate correlation between the 4 replicates on > each array > corfit <- duplicateCorrelation(mat, design, ndups=4) > > # fit the linear model, including info on duplicates and correlation > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) > > # calculate values using ebayes > ebayes <- eBayes(fit) > > # output a list of top differnetially expressed genes... > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > > -Paul. > > > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> <mailto:shi at="" wehi.edu.au="">> wrote: > > Dear Paul: > > The low p-values you have got are not surprising to me. I > have got even > lower FDR adjusted p-values than yours. This just means you got > significantly differentially expressed genes. But on the > other hand, I did > see much higher adjusted p-values in some of my microarray > analyses. In that > case, I will explore the data in more depth such as looking > at batch effect > etc. > > Cheers, > Wei > > > Paul Geeleher wrote: > > Hi folks, I'm analyzing microRNA data using limma and > I'm wondering about > the validity of the p-values I'm getting out. Its a > simple 'Group A Vs > Group > B' experimental design. 4 arrays in one group, 3 in the > other and 4 > duplicate spots for each miRNA on each array. > > The lowest adjusted p-values in the differential > expression analysis are > in > the region of 10^-7. > > Its been pointed out to me that plugging the point > values from each sample > into a regular t-test you get p=0.008, which then also > needs to take the > multiple test hit. Can anybody explain why limma is > giving me such lower > values and if they are valid? > > I can provide more information if required. > > Thanks, > > Paul. > > > > > > > -- > > Best wishes > Wolfgang > > ------------------------------------------------------- > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > ------------------------------------------------------- > > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > -- Best wishes Wolfgang ------------------------------------------------------- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD REPLY
0
Entering edit mode
Olga Vitek ▴ 40
@olga-vitek-3749
Last seen 9.0 years ago
Dear Paul, to see what's going on, it may be helpful to step away from Limma for a moment, and think of a case-control experiment with one array per subject in the context of standard mixed-effects ANOVA. The model for a gene will look like that: log(intensity) = GrandMean + Group{Fixed} + Array(Group){Random, confounded with Subject} + replicate(Array){Random} With this model, you test the difference between the groups against the variation between arrays. In the special case of balanced design (i.e. an equal number of arrays per group, and an equal number of replicates per array), this test only involves averages of replicates in each array. In other words, using this model will be equivalent to using the model log(averageIntensityOverReplicates) = GrandMean + Group{Fixed} + Array(Group){Random, confounded with Subject} after averaging the spots on an array. However the models are not equivalent if your design is unbalanced or if you have missing values. The difficulty with the ANOVA is that, when you have a small number of arrays (=subjects) per group, the variance of Array(Group) is not estimated reliably in a single gene. In Limma, Gordon took a different approach, and estimated the variance of Array(Group) prior to fitting the model, using the information across all genes. The advantage is that the estimator uses more data, and is therefore more reliable. The disadvantages are that (1) it makes the assumption of a same variance of Array(Group) in all genes, and (2) the subsequent model fit assumes that this variance is known, and not estimated from the same data. Therefore, I second Gordon's opinion: if you have enough arrays per group and a balanced design, than averaging is appropriate. If you have few arras but many technical replicates, estimating this variance using the duplicateCorrelation() may work best. And one more comment: although it is true that in many cases we want to make inference about populations (e.g. patients) from a random sample of individuals, this is not always the case. Experiments with a small number of subjects may not contain enough information about the underlying population. In this case one may want to restrict the inference to gene expressions in these specific subjects, and only account for technical variation in the microarray measurements. This translates into the ANOVA model log(intensity) = GrandMean + Group{Fixed} + Array(Group){Fixed, confounded with Subject} + replicate(Array){Random} In Limma, this will amount to adding "Array" as an extra model term, and skipping duplicateCorrelation(). Best regards Olga Vitek Assistant Professor Statistics and Computer Science Purdue University > > > ------------------------------ > > Message: 18 > Date: Thu, 29 Oct 2009 17:50:30 +1100 (AUS Eastern Daylight Time) > From: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Subject: Re: [BioC] Very low P-values in limma > To: Paul Geeleher <paulgeeleher at="" gmail.com=""> > Cc: "Mayer, Claus-Dieter" <c.mayer at="" abdn.ac.uk="">, Bioconductor mailing > list <bioconductor at="" stat.math.ethz.ch=""> > Message-ID: <pine.wnt.4.64.0910291625370.4880 at="" pc602.alpha.wehi.edu.au=""> > Content-Type: text/plain; charset="x-unknown"; Format="flowed" > > Dear Paul, > > Is it possible that you haven't quoted your professor verbatim?, > because > these comments don't make sense as they stand. I really know what he > might mean by real p-values or the assumption of zero measurement > error. > Measurement error obviously can't be zero. Nor can there be an > infinite > number of replicates. None of the alternative analysis methods we > have > discussed makes either of these assumptions. I wasn't the one > arguing for > averaging within-array replicates, but if that method did assume > what you > say, then it would have to be an invalid method. > > On the other hand, your professor is quite right to say that within- > array > replicates measure technical rather than biological variability. In a > univariate analysis, one would simply average the technical > replicates. > This would give a summary reponse variable, with a variance made up of > both biological and technical components, with replicates that you > could > reasonably treat as independent. > > In a genewise microarray analysis, averaging the within-replicates > has a > disadvantage in that it fails to penalize (lower the rank of) genes > which > have high within-array variability. If biological variability is high > compared to technical, and you have a enough array replicates to get a > decent estimate of between-array variability, then averaging the > within-array replicates is likely still the way to go, just as in a > univariate analysis. On the other hand, if technical variability > (within > and between arrays) is relatively large compared to biological, and > the > number of array replicates is very small, then the information in the > within-array variances can be too valuable to ignore. > duplicateCorrelation uses the fact that the between-array variance > has a > technical as well as a biological component, and the between and > within > technical components tend to be associated across probes for many > microarray platforms. It is this last assumption which allows us to > make > use of within-array standard deviations when making inferences about > between sample comparisons. > > If your priority is to get reliable p-values, and you think you have > enough array replication to do this, then average the within-array > replicates. If your array replication is limited, technical > variability > is high, and your priority is to rank the genes, then > duplicateCorrelation > may help. I would add that microarray p-values should always be taken > with a grain of salt, as it's impossible to verify all assumptions in > small experiments, and it's useful instead to think in terms of > independent verification of the results. > > This is really as far as I want to debate it. Obviously it's your > analysis and you should use your own judgement. As a maths graduate > student, you would be able to read the duplicateCorrelation published > paper if you want to check the reasoning in detail. > > Best wishes > Gordon > > > On Wed, 28 Oct 2009, Paul Geeleher wrote: > >> Dear list, >> >> The following are the words of a professor in my department: >> >> I still don't get why the 'real' p-values could be better than >> p-values you get with the assumption of zero measurement error. By >> averaging over within array replicates you are not ignoring the >> within >> array replicates, instead you are acting as though there were >> infinitely many of them, so that the standard error of the expression >> level within array is zero. Stats is about making inferences about >> populations from finite samples. The population you are making >> inferences about is the population of all late-stage breast cancers. >> The data are from 7 individuals. The within-array replicates give an >> indication of measurement error of the expression levels but don't >> give you a handle on the variability of the quantity of interest in >> the population. >> >> Paul >> >> On Sat, Oct 24, 2009 at 2:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> wrote: >>> >>> >>> On Sat, 24 Oct 2009, Gordon K Smyth wrote: >>> >>>> Dear Paul, >>>> >>>> Give your consensus correlation value, limma is treating your >>>> within-array >>>> replicates as worth about 1/3 as much as replicates on >>>> independent arrays >>>> (because 1-0.81^2 is about 1/3). >>> >>> Sorry, my maths is wrong. ?The effective weight of the within- array >>> replicates is quite a bit less than 1/3, given ndups=4 and cor=0.81. >>> >>> Best wishes >>> Gordon >>> >> >> >> >> -- >> Paul Geeleher >> School of Mathematics, Statistics and Applied Mathematics >> National University of Ireland >> Galway >> Ireland >>
ADD COMMENT

Login before adding your answer.

Traffic: 740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6