Question

limma, contrasts, and NA's

0

Entering edit mode

Albyn Jones ▴ 70

@albyn-jones-3850

Last seen 10.7 years ago

Following up on a thread I initiated last year on missing values in limma and contrasts.fit (see gmane.science.biology.informatics.conductor:26494 gmane.science.biology.informatics.conductor:26483) I have prepared an example illustrating the fact that when there are missing data, the contrast.fit function not only doesn't give exact results, but the results depend on the source chosen as the "reference" in modelMatrix(). When I pointed out to my colleagues in biology that they could manually construct a design matrix that would fit contrasts exactly, they rebelled at the need to construct N-1 linearly indpendent columns for N sources. The utility of contrasts.fit() is clear, avoiding the need for non-statisticians to learn to code design matrices. albyn -- Albyn Jones Reed College jones at reed.edu ====================================================================== === library(limma) ### necessary files and data are at: url <- "http://people.reed.edu/~renns/jones_renn/maternal_data/" targets <- readTargets(paste(url, "targets_fixed.csv", sep = ""), sep = ",") RG <- read.maimages(paste(url, targets$FileName, sep = ""), columns = list(R="F635 Median", G="F532 Median", Rb="B635 Median", Gb="B532 Median"), other.columns = c("Flags","B635 SD","B532 SD","F635 % Sat.","F532 % Sat.")) names(RG$other) <- c("Flags", "RbSD", "GbSD","RSat" ,"GSat" ) RG$genes <- readGAL(paste(url,"Fishchip4.03_annotatedGAL_20090730HEM.gal", sep = "")) RG$printer <- getLayout(RG$genes) ### Filter features manually flagged bad and automatedly flagged "not found" RG$R[RG$other$Flags < 0]<-NA RG$G[RG$other$Flags < 0]<-NA as.numericapplyis.na(RG$R),2,sum)) ### Filter faint features RG$R[(RG$R < (RG$Rb + 2*RG$other$RbSD))&(RG$G < (RG$Gb + 2*RG$other$GbSD))]<-0 RG$G[RG$R == 0]<-NA RG$R[RG$R == 0]<-NA as.numericapplyis.na(RG$R),2,sum)) tableapplyis.na(RG$R),1,sum)) ### Normalize MA<- normalizeWithinArrays(RG, method = "printtiploess", bc.method = "minimum") ### set up the design matrices design1<- modelMatrix(targets, ref = "a1126") design2<- modelMatrix(targets, ref = "a1217") ### contrast coding cont1 <- makeContrasts(WvsL = ( - a1111 - a1210 + a0201 + a1217 + a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6, levels=design1) cont2 <- makeContrasts(WvsL = (a1126 - a1111 - a1210 + a0201 + a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6, levels=design2) ### analysis fit1 <- lmFit (MA, design1) fit.cont1 <- contrasts.fit(fit1, cont1) fit.ebayes1 <- eBayes(fit.cont1) fit2 <- lmFit (MA, design2) fit.cont2 <- contrasts.fit(fit2, cont2) fit.ebayes2 <- eBayes(fit.cont2) results1 <- decideTests(fit.ebayes1, method = "separate", adjust.method="none", p.value=0.05) results2 <- decideTests(fit.ebayes2, method = "separate", adjust.method="none", p.value=0.05) res<-cbind(results1[,1], results2[,1]) colnames(res)<-c("ref1", "ref2") vennCounts(res, include="both") # ref1 ref2 Counts #[1,] 0 0 11862 #[2,] 0 1 30 #[3,] 1 0 100 #[4,] 1 1 814

limma limma • 1.3k views

ADD COMMENT • link updated 14.7 years ago by Gordon Smyth 52k • written 14.7 years ago by Albyn Jones ▴ 70

score 0 · Answer 1 · 2010-08-17

Dear Albyn, Many thanks for the very nice reproducible and complete data example. I've always agreed that spot-specific NA's do have an effect on the results from contrasts.fit(), and the results can change a bit with choice of reference. I will try to add some extra functionality to lmFit() so that users can extract contrasts at a stage when the genewise QR decompositions are still available, so all calculations can be done exactly. Thanks for sending me your version of code to do that. Let me come back though to the theme of whether there should be so many NAs in the data in the first place, and this is partly why I asked you for a reproducible example. In my experience, if there are enough NAs to cause problems for contrasts.fit(), then it is usually a symptom that too much spot flagging is being done. I hope you won't mind if I make some comments on the filtering in your example. I suspect that the filtering was probably imposed on you by your collaborators anyway. In your data example, nearly 40% of all spots on all arrays are being marked as NA. In my opinion, this is seriously over the top, unless the arrays are so bad as to be useless. I am very strongly opposed to the notion of filtering faint spots. Why remove perfectly good data just because it is a low value rather than a high value? I also don't have much confidence in GenePix flags. They are mainly a defense against things that we now know how to control. In my experience, biologists overestimate their ability to flag bad spots. On the other hand, filtering empty or non-expressed probes is a good idea, but whole probes should be removed, not isolated spots on individual arrays. Finally, filtering should be after the normalization rather than before. I would probably pre-process your data like this: library(limma) url <- "http://people.reed.edu/~renns/jones_renn/maternal_data" targets <- readTargets("targets_fixed.csv",path=url,sep=",") RG <- read.maimages(targets,path=url,source="genepix.median") RGb <- backgroundCorrect(RG,method="normexp",normexp.method="mle",offset=50) MA<- normalizeWithinArrays(RGb) Empty <- MA$genes$ID %in% c("empty","Empty") fit <- lmFit(MA[!Empty,], design) etc. Then there would be no missing values, and the whole issue of approximate calculation wouldn't arise. Best wishes Gordon --------- original message ------------ [BioC] limma, contrasts, and NA's Albyn Jones jones at reed.edu Fri Aug 13 20:42:14 CEST 2010 Following up on a thread I initiated last year on missing values in limma and contrasts.fit (see gmane.science.biology.informatics.conductor:26494 gmane.science.biology.informatics.conductor:26483) I have prepared an example illustrating the fact that when there are missing data, the contrast.fit function not only doesn't give exact results, but the results depend on the source chosen as the "reference" in modelMatrix(). When I pointed out to my colleagues in biology that they could manually construct a design matrix that would fit contrasts exactly, they rebelled at the need to construct N-1 linearly indpendent columns for N sources. The utility of contrasts.fit() is clear, avoiding the need for non-statisticians to learn to code design matrices. albyn -- Albyn Jones Reed College jones at reed.edu ====================================================================== === library(limma) ### necessary files and data are at: url <- "http://people.reed.edu/~renns/jones_renn/maternal_data/" targets <- readTargets(paste(url, "targets_fixed.csv", sep = ""), sep = ",") RG <- read.maimages(paste(url, targets$FileName, sep = ""), columns = list(R="F635 Median", G="F532 Median", Rb="B635 Median", Gb="B532 Median"), other.columns = c("Flags","B635 SD","B532 SD","F635 % Sat.","F532 % Sat.")) names(RG$other) <- c("Flags", "RbSD", "GbSD","RSat" ,"GSat" ) RG$genes <- readGAL(paste(url,"Fishchip4.03_annotatedGAL_20090730HEM.gal", sep = "")) RG$printer <- getLayout(RG$genes) ### Filter features manually flagged bad and automatedly flagged "not found" RG$R[RG$other$Flags < 0]<-NA RG$G[RG$other$Flags < 0]<-NA as.numericapplyis.na(RG$R),2,sum)) ### Filter faint features RG$R[(RG$R < (RG$Rb + 2*RG$other$RbSD))&(RG$G < (RG$Gb + 2*RG$other$GbSD))]<-0 RG$G[RG$R == 0]<-NA RG$R[RG$R == 0]<-NA as.numericapplyis.na(RG$R),2,sum)) tableapplyis.na(RG$R),1,sum)) ### Normalize MA<- normalizeWithinArrays(RG, method = "printtiploess", bc.method = "minimum") ### set up the design matrices design1<- modelMatrix(targets, ref = "a1126") design2<- modelMatrix(targets, ref = "a1217") ### contrast coding cont1 <- makeContrasts(WvsL = ( - a1111 - a1210 + a0201 + a1217 + a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6, levels=design1) cont2 <- makeContrasts(WvsL = (a1126 - a1111 - a1210 + a0201 + a1220 - a1211 - a1219 + a1128 + a1202 - a1028 - a0128)/6, levels=design2) ### analysis fit1 <- lmFit (MA, design1) fit.cont1 <- contrasts.fit(fit1, cont1) fit.ebayes1 <- eBayes(fit.cont1) fit2 <- lmFit (MA, design2) fit.cont2 <- contrasts.fit(fit2, cont2) fit.ebayes2 <- eBayes(fit.cont2) results1 <- decideTests(fit.ebayes1, method = "separate", adjust.method="none", p.value=0.05) results2 <- decideTests(fit.ebayes2, method = "separate", adjust.method="none", p.value=0.05) res<-cbind(results1[,1], results2[,1]) colnames(res)<-c("ref1", "ref2") vennCounts(res, include="both") # ref1 ref2 Counts #[1,] 0 0 11862 #[2,] 0 1 30 #[3,] 1 0 100 #[4,] 1 1 814 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}