Statistical comparison of low replicate affy data

0

Entering edit mode

Matthew Hannah ▴ 940

@matthew-hannah-621

Last seen 10.2 years ago

Hi, I'm looking at how different analyses of affy data perform on a sample data set of 2 conditions (untr vs. trt) with 3 biological reps (Ua, Ub, Uc vs. Ta, Tb, Tc). I've computed RMA and GCRMA expression measures as standard and so have 2 exprSets containing these values. I've looked at fold changes and the treatment leads to many (1000-RMA, 1600 - GCRMA) pairwise 2x changes (Ua-Ta, Ub-Tb, Uc-Tc all >2). In order to estimate the false positive rate I made pairwise comparisons within groups (Ua- Ub, Ua-Ub, Ub-Uc and same for T) and was suprised to see that with only 3 reps there were very few genes that met the 2x criteria by chance (<5 - RMA, <10 GCRMA). What are peoples views on estimating false positives in this way? I now want to make some statistical comparisons of the data both paired and un- paired. I was thinking of making these comparisons first Ua-Ta.. (paired) and then Uabc-Tabc (unpaired) and then permutate the data so to compare Ua,Tb,Uc - Ta,Ub,Tc....etc in various combinations paired and un-paired. Would this provide reliable false positive rates? I have looked into the BioC packages and guess I'll use a t-test with multest correction, LPE (although this doesn't say it accepts RMA-type data?) and SAM, EBAM & EBAM.WILC from siggenes. Are there others I should also consider? My request for help is if people have experience of applying these tests to affy data, specifically in the form of the RMA style exprSets my data is currently in, could they possibly post or send the r-scripts they used. I've obviously searched BioC and help but my attempts so far have returned errors and I can't help but think I'm missing something obvious (need to get rid of the affy ID's?) and obviously help would speed things up a great deal. For example > cl <- c(0,0,0,1,1,1) > rmasam <- sam(esetrma, cl) returned - Error in var(v) : missing observations in cov/cor > rmaebw <- ebam.wilc(esetrma, cl) returned - Error in 2^data : non-numeric argument to binary operator Obviously if anyone is interested in what results I (eventually) obtain then let me know. Thanks, Matt

affy gcrma siggenes LPE affy gcrma siggenes LPE • 1.9k views

ADD COMMENT • link 20.8 years ago Matthew Hannah ▴ 940

0

Entering edit mode

Rafael A. Irizarry ★ 2.3k

@rafael-a-irizarry-205

Last seen 10.2 years ago

On Wed, 18 Feb 2004, Matthew Hannah wrote: > Hi, > > I'm looking at how different analyses of affy data perform on a sample data > set of 2 conditions (untr vs. trt) with 3 biological reps (Ua, Ub, Uc vs. > Ta, Tb, Tc). I've computed RMA and GCRMA expression measures as standard and > so have 2 exprSets containing these values. > > I've looked at fold changes and the treatment leads to many (1000-RMA, 1600 - > GCRMA) pairwise 2x changes (Ua-Ta, Ub-Tb, Uc-Tc all >2). In order to estimate > the false positive rate I made pairwise comparisons within groups (Ua-Ub, Ua-Ub, > Ub-Uc and same for T) and was suprised to see that with only 3 reps there were > very few genes that met the 2x criteria by chance (<5 - RMA, <10 GCRMA). What > are peoples views on estimating false positives in this way? this is a complicated issue. one simple point is that if they are techinical reps you are understimating FP rates what you would get from biological reps. > > I now want to make some statistical comparisons of the data both paired and un- > paired. I was thinking of making these comparisons first Ua-Ta.. (paired) and > then Uabc-Tabc (unpaired) and then permutate the data so to compare Ua,Tb,Uc - > Ta,Ub,Tc....etc in various combinations paired and un-paired. Would this provide > reliable false positive rates? > > I have looked into the BioC packages and guess I'll use a t-test with multest > correction, LPE (although this doesn't say it accepts RMA-type data?) and SAM, > EBAM & EBAM.WILC from siggenes. Are there others I should also consider? > > My request for help is if people have experience of applying these tests to > affy data, specifically in the form of the RMA style exprSets my data is > currently in, could they possibly post or send the r-scripts they used. I've > obviously searched BioC and help but my attempts so far have returned errors > and I can't help but think I'm missing something obvious (need to get rid of > the affy ID's?) and obviously help would speed things up a great deal. > > For example > > cl <- c(0,0,0,1,1,1) > > rmasam <- sam(esetrma, cl) > returned - Error in var(v) : missing observations in cov/cor > > > rmaebw <- ebam.wilc(esetrma, cl) > returned - Error in 2^data : non-numeric argument to binary operator > > Obviously if anyone is interested in what results I (eventually) obtain then > let me know. > > Thanks, > > Matt > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 20.8 years ago Rafael A. Irizarry ★ 2.3k

0

Entering edit mode

Matthew Hannah ▴ 940

@matthew-hannah-621

Last seen 10.2 years ago

I realise that with tech reps you would underestimate FP's but as I said in this case they are biological reps. The data is from 3 growthroom exp over a 6 month period. That was why I was suprised at how few FP's there were. In fact for consistent 1.5 fold changes the numbers are equally impressive with only around 15 FP's for RMA and 50 for GCRMA. What do people think about this?

ADD COMMENT • link 20.8 years ago Matthew Hannah ▴ 940

0

Entering edit mode

Hello All, I am quite new to BioC and would appreciate your help on this. I am experimenting with AffyPLM and taking a look the post AffyPLM quality diagnostic pseudo-chip images [pset <- fitPLM(eset) and then image(pset) ]. Can anybody recommend how one shoud interpret these images as a quality-checking step? Further, could one clarify for me what the differences between the plots that show weight, and those that show residuals? I thank you for your kind attention, Lawrence ___________________________________ Lawrence-Paul Petalidis University of Cambridge Department of Pathology Division of Molecular Histopathology Addenbrookes Hospital, Level 3, Lab Block Hills Road, CB2 2QQ Cambridge Tel. : ++44 1223 762084 Fax : ++44 1223 586670

ADD REPLY • link 20.8 years ago Lawrence Paul Petalidis ▴ 130

0

Entering edit mode

Paul, The weights are derived from the model fit residuals - they are transformations of the residuals standardized by the model fit residual variance (fit here being probe set specific). Weights will be 1.0 if the residuals are small compared to the residual variance, and then decrease toward zero as the value of the absolute standardized residuals increase. On the chip speudo-image of the weights, what is highlighted is the spatial distribution of probes with large absolute standardized residuals - outliers, in a sense, that might have an impact on the fit, although this impact is minimized by the robustness of the fit. If there is a local artifact - a scratch, uneven hybridization, incomplete wash, bubble, etc - this will appear as a cluster on the chip weight pseudo-image. You could also see a chip where residuals are uniformly elevated throughout the chip indicating that either the RNA preparation, or the hybridization assay failed. The pseudo images of the residuals are just images of untransformed residuals. Here you may see local clusters that do not appear in the weights - corresponding to slightly dim or slightly bright spots which lead to elevated residuals, but not elevated enough to be picked up by the weights. These are useful to detect effects which may be good to know but are too subtle to be picked up in the weights. In general, the images of the residuals tell the same story as the weights. -francois --- Lawrence Paul Petalidis <lpp22@cam.ac.uk> wrote: > Hello All, > I am quite new to BioC and would appreciate your > help on this. I am > experimenting with AffyPLM and taking a look the > post AffyPLM quality > diagnostic pseudo-chip images [pset <- fitPLM(eset) > and then > image(pset) ]. Can anybody recommend how one > shoud interpret these > images as a quality-checking step? Further, could > one clarify for me > what the differences between the plots that show > weight, and those that > show residuals? > I thank you for your kind attention, Lawrence > > > > ___________________________________ > Lawrence-Paul Petalidis > University of Cambridge > Department of Pathology > Division of Molecular Histopathology > Addenbrookes Hospital, Level 3, Lab Block > Hills Road, CB2 2QQ > Cambridge > > Tel. : ++44 1223 762084 > Fax : ++44 1223 586670 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 20.8 years ago Francois Collin ▴ 130

0

Entering edit mode

Because the methods for combining the probes into gene expression values are robust to outliers, and because the probes are printed so that probes from the same genes are spatially dispersed, scratches and small "blobs" should not have much effect on your results. Defects that cover a large percentage of your array will certainly be poor. I have looked at about 50 array images (mostly arabidopsis and mouse) and all have horizontal striping that appears to be a scanner artifact. --Naomi At 09:31 AM 2/19/2004, Francois Collin wrote: >Paul, >The weights are derived from the model fit residuals - >they are transformations of the residuals standardized >by the model fit residual variance (fit here being >probe set specific). Weights will be 1.0 if the >residuals are small compared to the residual variance, >and then decrease toward zero as the value of the >absolute standardized residuals increase. On the chip >speudo-image of the weights, what is highlighted is >the spatial distribution of probes with large absolute >standardized residuals - outliers, in a sense, that >might have an impact on the fit, although this impact >is minimized by the robustness of the fit. If there >is a local artifact - a scratch, uneven hybridization, >incomplete wash, bubble, etc - this will appear as a >cluster on the chip weight pseudo-image. You could >also see a chip where residuals are uniformly elevated >throughout the chip indicating that either the RNA >preparation, or the hybridization assay failed. > >The pseudo images of the residuals are just images of >untransformed residuals. Here you may see local >clusters that do not appear in the weights - >corresponding to slightly dim or slightly bright spots >which lead to elevated residuals, but not elevated >enough to be picked up by the weights. These are >useful to detect effects which may be good to know but >are too subtle to be picked up in the weights. In >general, the images of the residuals tell the same >story as the weights. > >-francois > > >--- Lawrence Paul Petalidis <lpp22@cam.ac.uk> wrote: > > Hello All, > > I am quite new to BioC and would appreciate your > > help on this. I am > > experimenting with AffyPLM and taking a look the > > post AffyPLM quality > > diagnostic pseudo-chip images [pset <- fitPLM(eset) > > and then > > image(pset) ]. Can anybody recommend how one > > shoud interpret these > > images as a quality-checking step? Further, could > > one clarify for me > > what the differences between the plots that show > > weight, and those that > > show residuals? > > I thank you for your kind attention, Lawrence > > > > > > > > ___________________________________ > > Lawrence-Paul Petalidis > > University of Cambridge > > Department of Pathology > > Division of Molecular Histopathology > > Addenbrookes Hospital, Level 3, Lab Block > > Hills Road, CB2 2QQ > > Cambridge > > > > Tel. : ++44 1223 762084 > > Fax : ++44 1223 586670 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 20.8 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 45 minutes ago

United States

For siggenes you can't pass an exprSet. You have to pass a matrix of expression values, e.g., exprs(esetrma). Since the affy IDs will be the row names of the matrix, you don't have to get rid of them. Also note that any permutation testing you do with these data is going to be a bit sketchy because the null distribution is going to be extremely coarse. For instance, there are only 20 combinations of your data that you can use to estimate the null in the unpaired case. A general (minimum) recommendation for estimating the null is to use 500 - 1000 permutations which you obviously don't have. With these data you might be better off using limma for the analysis and use a non-permuted fdr to estimate false positives. There is an example in the online manual for limma that does almost exactly what you are trying to accomplish. Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> "Matthew Hannah" <hannah@mpimp-golm.mpg.de> 02/18/04 08:31AM >>> Hi, I'm looking at how different analyses of affy data perform on a sample data set of 2 conditions (untr vs. trt) with 3 biological reps (Ua, Ub, Uc vs. Ta, Tb, Tc). I've computed RMA and GCRMA expression measures as standard and so have 2 exprSets containing these values. I've looked at fold changes and the treatment leads to many (1000-RMA, 1600 - GCRMA) pairwise 2x changes (Ua-Ta, Ub-Tb, Uc-Tc all >2). In order to estimate the false positive rate I made pairwise comparisons within groups (Ua-Ub, Ua-Ub, Ub-Uc and same for T) and was suprised to see that with only 3 reps there were very few genes that met the 2x criteria by chance (<5 - RMA, <10 GCRMA). What are peoples views on estimating false positives in this way? I now want to make some statistical comparisons of the data both paired and un- paired. I was thinking of making these comparisons first Ua-Ta.. (paired) and then Uabc-Tabc (unpaired) and then permutate the data so to compare Ua,Tb,Uc - Ta,Ub,Tc....etc in various combinations paired and un-paired. Would this provide reliable false positive rates? I have looked into the BioC packages and guess I'll use a t-test with multest correction, LPE (although this doesn't say it accepts RMA-type data?) and SAM, EBAM & EBAM.WILC from siggenes. Are there others I should also consider? My request for help is if people have experience of applying these tests to affy data, specifically in the form of the RMA style exprSets my data is currently in, could they possibly post or send the r-scripts they used. I've obviously searched BioC and help but my attempts so far have returned errors and I can't help but think I'm missing something obvious (need to get rid of the affy ID's?) and obviously help would speed things up a great deal. For example > cl <- c(0,0,0,1,1,1) > rmasam <- sam(esetrma, cl) returned - Error in var(v) : missing observations in cov/cor > rmaebw <- ebam.wilc(esetrma, cl) returned - Error in 2^data : non-numeric argument to binary operator Obviously if anyone is interested in what results I (eventually) obtain then let me know. Thanks, Matt _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.8 years ago James W. MacDonald 67k

0

Entering edit mode

Matthew Hannah ▴ 940

@matthew-hannah-621

Last seen 10.2 years ago

Thanks for the responses. I used SAM by calling it on the exprsSet at was suggested. (It also worked on table made from a text file output from rma). Anyway when I did the analysis I got some 'interesting' results (see bottom of mail). Basically unpaired the data gave me 2500 genes delta > 2 with a FDR of 0.003, and paired (the U/T reps were conducted on the same plant batch) 4000 > 2 with FDR of 0.001. However if I permutate the input data (see ex. 3 below) then it just returns zeros. I guess this could be due to the coarseness of the comparisons as suggested but I'll just give a few more details of my data to see what people think. Basically the data is highly reproducable between biological replicates but there is a big 'treatment' effect (this is what we want!?). For example Rsq for within replicate x-y scatter plots of GCRMA data are 0.97 - 0.99, whilst for the 3 U-T comparisons the values are 0.92-0.93. So as I interpret things then as soon as you permutate the data you get very different data sets being mixed, massively increasing the variance and so few significant changes are detected, hence a very low FDR. If you input the data already permutated then some of the permutations of this data have loads of sig changes (as they represent the correct data order) and so FDR is huge and SAM returns all 0's. So where does this leave us, not using a test because the data is too 'good' seems abit strange. But equallly not knowing how reliable it is is also not good. Also on a more general note, when you get to this stage with so many changes (1 rep U-T comparison with GCRMA - 5000 1.5x and 2500 2x changes) is the data violating the assumption for the normalisation that most genes remain unchanged? I'll investigate the limma package. Thanks Matt > cl = c(0,0,0,1,1,1) > rmasam <- sam(rma, cl) SAM Analysis for the two class unpaired case. s0 = 0.0695 (The 15 % quantile of the s values.) SAM Analysis for a set of delta: delta p0 false called FDR 1 0.2 0.723 9638 13270 0.525 2 0.4 0.723 3951 9543 0.299 3 0.6 0.723 1634 7480 0.158 4 0.8 0.723 643 6068 0.077 5 1.0 0.723 286 5155 0.040 6 1.2 0.723 131 4394 0.022 7 1.4 0.723 64 3764 0.012 8 1.6 0.723 35 3259 0.008 9 1.8 0.723 18 2846 0.005 10 2.0 0.723 10 2478 0.003 > cl = c(1,2,3,-1,-2,-3) > rmasamp <- sam(rma, cl) SAM Analysis for the two class paired case. s0 = 0.0733 (The 45 % quantile of the s values.) SAM Analysis for a set of delta: delta p0 false called FDR 1 0.2 0.52 9631 17275 0.290 2 0.4 0.52 2378 13276 0.093 3 0.6 0.52 695 10684 0.034 4 0.8 0.52 257 8922 0.015 5 1.0 0.52 127 7664 0.009 6 1.2 0.52 53 6575 0.004 7 1.4 0.52 28 5652 0.003 8 1.6 0.52 14 4985 0.001 9 1.8 0.52 9 4370 0.001 10 2.0 0.52 5 3867 0.001 > cl = c(0,1,0,1,0,1) > rmasamperm <- sam(rma, cl) SAM Analysis for the two class unpaired case. s0 = 0.0549 (The 5 % quantile of the s values.) SAM Analysis for a set of delta: delta p0 false called FDR 1 0.2 1 0 0 0 2 0.4 1 0 0 0 3 0.6 1 0 0 0 4 0.8 1 0 0 0 5 1.0 1 0 0 0 6 1.2 1 0 0 0 7 1.4 1 0 0 0 8 1.6 1 0 0 0 9 1.8 1 0 0 0 10 2.0 1 0 0 0

ADD COMMENT • link 20.8 years ago Matthew Hannah ▴ 940

Login before adding your answer.