Do I map probes to gene symbols BEFORE or AFTER differentially expressed gene analysis?
1
0
Entering edit mode
JackieMe • 0
@jackieme-14035
Last seen 5.3 years ago

I am learning to do differentially expressed gene analysis on microarray data(typically Affaymetrix GeneChip Human Genome U133 Plus 2.0 Array) of tumor vs normal. I've read some tutorials and know the basic steps. But I never saw people talk about when exactly to annotate the probes to gene symbols.
I now do DEG analysis on probe level and map probes to gene symbols after I get a list of differentially expressed probes. However, many differentially expressed probes map to the same gene, which resluts in much less DEGs than differentially expressed probes. I've also seen poeple map probes to gene symbols before DEG analysis, exactly after they get eset( eset <- exprs(rma(affyBatch)) ) . I am really confused. 
I give the briefing of my code below:

library(GEOquery)
library(affycoretools)
library(limma)
library(dplyr)

gse54129 <- getGEO('GSE54129', destdir = '../GSE/series_matrix', getGPL = FALSE)
pdata <- pData(phenoData(gse54129[[1]]))
group <- ifelse(pdata$characteristics_ch1 == 'tissue: normal', 'Normal', 'Tumor')
group <- factor(group, levels=c('Normal', 'Tumor'))
table(group)
# Normal  Tumor 
#     21    111
###### load CEL files #####
library(affy)
files <- paste('../GSE/CEL/GSE54129/', list.files('../GSE/CEL/GSE54129/', pattern = 'CEL.gz'), sep = '')
affyBatch <- ReadAffy(filenames=files)
sampleNames(affyBatch)
sampleNames(affyBatch) <- substr(sampleNames(affyBatch), 1, 10)
sampleNames(affyBatch)

data.rma <- rma(affyBatch)

eset <- exprs(data.rma)

plotPCA(eset, addtext=colnames(eset), groups=factor(pdata$characteristics_ch1), legend=FALSE)

##### DEG #######
mode(eset)
# [1] "numeric"

design <- model.matrix(~group)

fit <- lmFit(eset, design)

fit2 <- eBayes(fit)
topTable(fit2, coef = 2, adjust.method = 'BH')
res <- decideTests(fit2, lfc = 2)
summary(res)
#    (Intercept) groupTumor
# -1           0        554
# 0            0      53609
# 1        54675        512

deg.54129 <- topTable(fit2, coef = 2, number = Inf, adjust.method = 'BH')
deg.54129$gene <- rownames(deg.54129)

deg.54129.sig2 <- filter(deg.54129, adj.P.Val < 0.05 & abs(logFC) >= 2)
dim(deg.54129.sig2)
# [1] 1066    7

library(hgu133plus2.db)
probes <- deg.54129.sig2$gene
gene.symbol <- as.character(as.list(hgu133plus2SYMBOL[probes]))
gene.entrez <- as.character(as.list(hgu133plus2ENTREZID[probes]))

deg.54129.sig2 <- cbind(PROBE=deg.54129.sig2$gene, SYMBOL=gene.symbol, ENTREZID=gene.entrez,
                        deg.54129.sig2[, -ncol(deg.54129.sig2)])
deg.54129.sig2 <- arrange(deg.54129.sig2, SYMBOL, desc(abs(logFC)))
deg.54129.sig2 <- deg.54129.sig2[!duplicated(deg.54129.sig2$SYMBOL), ]
dim(deg.54129.sig2)
# [1] 729   9
save(deg.54129.sig2, file = 'gse54129.deg.probe_level.rda')
limma deg microarray • 1.7k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

It doesn't make any difference whether you map to probes before the DE analysis or after, although I myself would always do it before.

You could follow the example case study in Section 17.2 of the limma User's Guide. The example is for hgu95 affy arrays instead of u133plus2 but the code should be pretty straightforward to adapt.

ADD COMMENT
0
Entering edit mode

Thank you Gordon! I am sorry if I am asking some strange questions.

In addition to this, multiple mapping ( dual probes mapping to a gene ) is common in hgu133plus array. I've seen some posts online said that to replace the expression values of the probes with thier mean/max expression values.    

Is there a recommended way to deal with this ? And again, should I deal this this  BEFORE or AFTER differentially expressed gene analysis?  If possible, can you show me some example code or recommend a package to do this? 
 

ADD REPLY
0
Entering edit mode

I usually don't see any need to do anything about multiple probes for a given gene. If however there really is a need to keep just one probe for each gene, then I do it like described here: Duplicate gene ID's returned from limma microarray analysis

ADD REPLY
0
Entering edit mode

Thanks! That's exactly what I am looking for.

ADD REPLY

Login before adding your answer.

Traffic: 784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6