Dear Community,
I'm relatively new to R and i have recently processed and analyzed through limma an Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) dataset, concerning some statistical comparisons. My basic code is presented below:
SDRF <- read.delim("GSE12435.sdrf.txt",check.names=FALSE,stringsAsFactors=FALSE) files <- list.files(pattern = "GSM") dat <- read.maimages(files, source="agilent", green.only=T) pdat <- read.table("GSE12435.sdrf.txt", sep="\t",header=TRUE) pdat2 <- pdat[,2:3] dat$targets <- pdat2 y <- backgroundCorrect(dat,method="normexp") y <- normalizeBetweenArrays(y,method="quantile") rownames(y)<-y$genes$ProbeName control.index <- which(y$genes$ControlType %in% c("1", "-1")) y2 <- y[-control.index,] # to remove control probes head(rownames(y2)) [1] "A_24_P66027" "A_32_P77178" "A_23_P212522" "A_24_P9671" "A_24_P801451" "A_32_P30710" ind <- rowSums(y2$E > 6) >=4 y2 <- y2[ind,] # some naive basic intensity filtering group <- factor(y2$targets$Sample.and.Data.Relationship.Format,levels=c("control","bystander 4h","irradiated 4h")) batch <- factor(y2$targets$Batch) design <- model.matrix(~group + batch) fit <- lmFit(y2, design) fit2 <- eBayes(fit, trend=TRUE) top2 <- toptable(fit2, coef=2, number=nrow(fit2), adjust.method = "fdr", sort="none") head(top2) ID logFC t P.Value adj.P.Val B 1 A_24_P66027 0.18871738 1.0400444 0.32212991 0.6150887 -5.778509 2 A_32_P77178 -0.17539127 -0.6735214 0.51545197 0.7570036 -6.088395 3 A_23_P212522 0.41854183 2.5192718 0.02981977 0.2639606 -3.720464 4 A_24_P9671 0.32246171 2.3800601 0.03794132 0.2854077 -3.945223 5 A_24_P801451 -0.58180526 -3.1227178 0.01046536 0.2044593 -2.726698 6 A_32_P30710 -0.02033664 -0.1193851 0.90726804 0.9595242 -6.316518 prob.names <- as.character(top2$ID) gns <- select(hgug4112a.db, prob.names,columns=c("SYMBOL")) 'select()' returned many:1 mapping between keys and columns dim(gns) [1] 28592 2 # Thus, there is not something wrong here ? in detail, when i type sum(duplicated(gns$PROBEID)) [1] 2007 # whereas dim(y2) [1] 28592 12
In other words, with the many:1 mapping the resulted gns dataframe should not have more rows that the length of my probe.sets-probeIDs of my expression set ?
Thank you,
Konstantinos
Dear Gordon,
sorry to reply with delay and thank you for your response !! but the annotation frame after using the function select() should not have more rows ? due to the duplicate PROBEIDs i posted above-- ? l
like also in the following example in the vignette
(http://bioconductor.org/packages/release/bioc/vignettes/affycoretools/inst/doc/RefactoredAffycoretools.pdf)
Thank you !!
Konstadinos
You are incorrectly interpreting the message. You have duplicated probes on the array, and when you ask for the HUGO symbol, the select function is telling you that you are doing a many:1 mapping. But this doesn't mean that you should have different numbers of rows. In fact, for a many:1 mapping, you are guaranteed to have the same number of rows, as well as the same order. As an example:
And to further illustrate my point:
So if I ask for the symbol for the same probe ID five times, I am returned a five-row data.frame, with a message that I have a many:1 mapping (five probe IDs -> one HUGO symbol).