Help: obtain Entrez ID's from limma top table with Affymetrix microarray (pd.zebgene.1.1.st)
1
1
Entering edit mode
mat149 ▴ 80
@mat149-11450
Last seen 6 days ago
United States

Hello,

I am working with Affymetrix 1.1 ST zebrafish gene arrays.  The script I have attached works but the pd.zebgene.1.1.st annotation package for these particular arrays do not contain Entrez ID's for each probeset on the platform.  I have seen that expression sets can be annotated using, for instance, library(affycoretools) and library(hugene11sttranscriptcluster.db).  My problem is that the zebrafish Affy 1.1 ST gene arrays do not contain transcriptcluster.db or probesetcluster.db packages that are available for the human/mouse/rat analogues.  My question is: Can anyone explain/provide a reference as to how I can obtain Entrez ID's for my limma topTable results? I fear that I am not good enough at "R" to create my own annotation packages for this array.

 

When I apply the "annotateEset" function, the Entrez ID's are not returned in my topTable results which look like:

PROBEID ID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B
13305218 AF109369 opn1mw1 opsin 1 (cone pigments), medium-wave-sensitive, 1 -6.1018 8.754473 -10.8657 1.08E-08 1.98E-05 10.13624
13187576 AF109373 opn1sw1 opsin 1 (cone pigments), short-wave-sensitive 1 -5.19865 8.99628 -9.02191 1.36E-07 9.19E-05 7.81309
13276068 NA NA NA -4.76802 7.529454 -7.27885 2.11E-06 0.00053 5.207056
12988871 BC134193 gngt2b guanine nucleotide binding protein (G protein), gamma transducing activity polypeptide 2b -4.75024 7.834161 -11.5697 4.49E-09 1.21E-05 10.92152
13282126 BC059650 arr3b arrestin 3b, retinal (X-arrestin) -4.297 6.427947 -6.55812 7.38E-06 0.001151 3.99979

 

 

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELlist <- list.celfiles("C:\\Users\\mat149\\Desktop\\GG\\CEL", full.names=TRUE, pattern=NULL, all.file=FALSE)
pdat<-read.AnnotatedDataFrame("C:\\Users\\mat149\\Desktop\\GG\\CEL\\pheno16.txt",header=TRUE,row.name="Filename",sep="\t")
CELdat <- read.celfiles(filenames = CELlist,experimentData=TRUE,phenoData=pdat,verbose=TRUE)

 

ph = CELdat@phenoData 
ph@data[ ,1] = c("WT1","WT2","WT3","WT4","WT5","WT6","WT7","WT8","lepA_MO1","lepA_MO2","lepA_MO3","lepA_MO4","lepA_rescue1","lepA_rescue2","lepA_rescue3","lepA_rescue4")
ph@data[ ,2] = c("CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","morphant","morphant","morphant","morphant","rescue","rescue","rescue","rescue")
colnames(ph@data)[2]="Treatment"
colnames(ph@data)[1]="Sample"
groups = ph@data$Treatment
f = factor(groups,levels=c("CONTROL","morphant","rescue"))

 

eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core")     
featureData(eset)<-getNetAffx(eset, type = "transcript")
data.matrix = exprs(eset)
library(affycoretools)
librarypd.zebgene.1.1.st)
eset.annot <- annotateEset(eset, annotation(eset))
estt <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multivals = filter,annocols=1, probecol=2)

 

library(limma)
design <- model.matrix(~ 0+factor(c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)))
colnames(design) <- c("CONTROL","lepA_MO","lepA_RS")
fit <- lmFit(estt, design)
contrast.matrix <- makeContrasts(lepA_MO-CONTROL, lepA_RS-CONTROL, lepA_MO-lepA_RS, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
results <- decideTests(fit2,p.value=0.01,adjust="BH",lfc=2)

MOWT<-topTable(fit2, coef=1,adjust="BH",p.value=0.01,lfc=2,sort.by="P",n=Inf)

 

 

 

 

 

 

 

affmetrix microarray limma oligo • 2.0k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

Or you could use mapIds directly

fd <- fData(eset)

fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")

fData(eset) <- fd

and then proceed.

ADD COMMENT
0
Entering edit mode

Thanks for the reply.  I have added your script but I am still getting an error message:

 

"Error in `$<-.data.frame`(`*tmp*`, "ENTREZID", value = list(`30037` = 14730L,  : 
  replacement has 37237 rows, data has 75212"

 

The script I ran was:

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELlist <- list.celfiles("C:\\Users\\mat149\\Desktop\\GG\\CEL", full.names=TRUE, pattern=NULL, all.file=FALSE)
pdat<-read.AnnotatedDataFrame("C:\\Users\\mat149\\Desktop\\GG\\CEL\\pheno16.txt",header=TRUE,row.name="Filename",sep="\t")
CELdat <- read.celfiles(filenames = CELlist,experimentData=TRUE,phenoData=pdat,verbose=TRUE)

eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core")     
featureData(eset)<-getNetAffx(eset, type = "transcript")
data.matrix = exprs(eset)
library(affycoretools)
librarypd.zebgene.1.1.st)
eset.annot <- annotateEset(eset, annotation(eset))
eset1 <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ZFIN","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multiVals = filter,annocols=1, probecol=2)


library(org.Dr.eg.db)
keys <- keys(org.Dr.eg.db)
fd <- fData(eset1)
fd$ENTREZID <- mapIds(org.Dr.eg.db,keys=keys,fd$SYMBOL,column="ENTREZID",keytype="ENTREZID", multiVals = list) -----> error after this line
fData(eset) <- fd

 

 

 

 

 

 

ADD REPLY
0
Entering edit mode

You are doing a bunch of extra things that don't really help your cause. This:

featureData(eset)<-getNetAffx(eset, type = "transcript")

and this

eset.annot <- annotateEset(eset, annotation(eset))

do very similar things. Is there a particular rationale to do both? And then you do this:

eset1 <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ZFIN","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multiVals = filter,annocols=1, probecol=2)

which is a mixture of arguments for two different methods, and probably doesn't do what you think it is doing. In the end you have three (!) different ExpressionSets that are annotated using three closely related methods. You can really only use one at a time, so having three is unnecessarily complex.

The final error comes from the fact that you are doing something that makes no sense at all, and doesn't even remotely resemble what I recommended you do. You are trying to find the Entrez Gene ID for all the genes on your array, right? But then you do this:

keys <- keys(org.Dr.eg.db)

That gets all the Entrez Gene IDs in the org.Dr.eg.db package, which corresponds to all known D. rerio Entrez Gene IDs. You then do

mapIds(org.Dr.eg.db,keys=keys,fd$SYMBOL,column="ENTREZID",keytype="ENTREZID", multiVals = list)

Which, translated to English is saying 'give me all the Entrez Gene IDs for D. rerio that are in the org.Dr.eg.db, based on this set of Entrez Gene IDs that I just extracted from that package'. The fact that you put fd$SYMBOL in that function call, without using an argument, means that it is in essence ignored.

You can use positional argument matching, where you put the arguments in the correct position, or you can use direct argument matching with the equals sign, but using a mixture of both will often have unintended consequences.

I gave you three lines of code that would have done exactly what you wanted to do, but you didn't use it! This doesn't have to be as complicated as you are making it, and you should NEVER be accessing S4 slots directly using the @ symbol! Rather than doing all this extra stuff that is not helpful, you can do what you want in very few lines of code.

This is what your script should look like:

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELdat <- read.celfiles(list.celfiles())
eset<-rma(CELdat)
library(affycoretools)
eset <- annotateEset(eset, annotation(eset))
## extra annotation
fd <- fData(eset)
fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")
fData(eset) <- fd
## modeling
library(limma)
grps <- factor(rep(c("WT","lepA_MO","lepA_rescue"), c(8,4,4)))
design <- model.matrix(~ 0+grps)
colnames(design) <- gsub("grps", "", colnames(design))
fit <- lmFit(eset, design)
contrast.matrix <- makeContrasts(lepA_MO-CONTROL, lepA_RS-CONTROL, lepA_MO-lepA_RS, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
results <- decideTests(fit2,p.value=0.01,adjust="BH",lfc=2)

MOWT<-topTable(fit2, coef=1, adjust="BH", p.value=0.01, lfc=2, sort.by="P", n=Inf)

Note that adding in all the phenoData isn't strictly necessary, ESPECIALLY if you add it all in and then never use it for anything! Your goal should be to simplify what you do, limit extra steps, and understand what you are doing and why.

ADD REPLY
0
Entering edit mode

Hello James, thanks again for your help. I am not a computer scientist by any means so forgive my ignorance.  I have followed the lines of code you have listed but I ran into this error when I tried to run the line: 

fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")

Error in .testForValidKeys(x, keys, keytype, fks) : 
  'keys' must be a character vector

I think I need to tell AnnotationDbi which keys/keytype from the expression set feature data to map to Entrez ID's available in org.Dre.eg.db, but I dont know how to do that (the eset contains "PROBEID" identifiers in the first column).  To answer your question, yes I am trying to obtain Entrez ID's for all genes on the array.  

 

ADD REPLY
0
Entering edit mode

You are misinterpreting the problem. The error you get is

Error in .testForValidKeys(x, keys, keytype, fks) : 
  'keys' must be a character vector

Which is telling you that the 'keys' argument needs to be a character vector. If we look at the arguments for mapIds, we have:

> args(mapIds)
function (x, keys, column, keytype, ..., multiVals)

So the 'keys' argument is the second argument, for which you are passing fd$SYMBOL. And the error message is telling you that this must be character. You could hypothetically test to see what class that is, using class(fd$SYMBOL) (it's a factor), but that won't tell you how to 'fix' the situation. BUT Google will, and Google is almost always your best friend! Part of being an informed programmer is knowing how to answer questions. A simple Google search for 'R convert factor to character' will turn up any number of links that will tell you to use as.character.

At which point you will know that the answer is

fd$ENTREZID <- mapIds(org.Dr.eg.db, as.character(fd$SYMBOL), "ENTREZID","SYMBOL")
ADD REPLY
0
Entering edit mode

Thank you very much; this worked and I also learned a few neat things in the process of you explaining my fails.

Vouch for James!

ADD REPLY

Login before adding your answer.

Traffic: 623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6