How to annotate the [MoGene-2_0-st] Affymetrix Mouse Gene 2.0 ST Array chip
1
0
Entering edit mode
llkxiaolan ▴ 10
@llkxiaolan-13767
Last seen 6.7 years ago

I am a Chinese student, English is not very good, some places may not be clear, I hope you can understand 

Recently, I've been using oligo packages to analyze Affymetrix Mouse Gene 2 .0 ST Array chips. But I'm not going to convert the probe's ID into the ID of the gene. This problem has been bothering me for a long time. I checked some information and didn't solve it. Is there anyone who can help me? Thank you very much .

Here's the code I'm using :

library(oligo)

celFiles <- list.celfiles()

affyRaw <- read.celfiles(celFiles)

librarypd.mogene.2.0.st)

eset <- rma(affyRaw)

library(limma)

design <- model.matrix(~ 0+factor(c(1,1,1,2,2,2)))
colnames(design) <- c("group1", "group2")
contrast.matrix <- makeContrasts(contrasts="group2-group1",levels=design)
design
fit <- lmFit(eset, design)
fit1<- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit1)

dif<-topTable(fit2,coef="group2-group1",n=nrow(fit2),lfc=log2(2))
dif<-dif[dif[,"adj.P.Val"]<0.05,]
head(dif)

 

I can only do it here, how to do the ID conversion, I can not do it, can anyone help me, thank you again 

annotation oligo • 5.0k views
ADD COMMENT
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 4 weeks ago
Wageningen University, Wageningen, the …

Most convenient would be using the function annotateEset() from the package affycoretools. Use as input the (your) normalized object eset. You can then annotate your dataset using either the corresponding pdInfo package, or the ChipDb package.

The annotation info available in the PdInfo package is basically a 1:1 copy of the info made available by Affymetrix on their support pages (in e.g. the file MoGene-2_0-st-v1.na36.mm10.transcript.csv). The the latter (ChipDb) is fully generated using the Bioconductor infrastructure; only the mapping probeset -> gene ID is extracted from the before-mentioned csv file. Thus:

library(affycoretools)

# using the PdInfo package
eset.anno1 <- annotateEset(eset, pd.mogene.2.0.st)

# using the ChipDb package
library(mogene20sttranscriptcluster.db)
eset.anno2 <- annotateEset(eset, mogene20sttranscriptcluster.db)

 

Then continue with the analysis in limma using the object eset.annox, the annotation info will be automagically added to the limma output.

 

 

ADD COMMENT
0
Entering edit mode

First of all, thank you very much for your answer, but I have done it according to your method. After that, it seems that the problem has not been solved, and there are many NA values. I don't know what caused it. 

  PROBEID ID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B
17203807 17203807 NA NA NA -0.63256 1.160775 -8.28233 2.47E-05 0.6607 -2.37527
17201831 17201831 NA NA NA -0.81393 3.886475 -7.03522 8.36E-05 0.6607 -2.50944
17278777 17278777 NR_046306 DQ267102 snoRNA DQ267102 0.848883 2.678946 6.954444 9.10E-05 0.6607 -2.52004
17207623 17207623 NA NA NA -1.18329 2.450527 -6.82897 0.000104 0.6607 -2.53705
17507910 17507910 NM_007844 Defa-rs1 defensin, alpha, related sequence 1 0.699768 5.049872 6.689652 0.000121 0.6607 -2.55676
17207769 17207769 NA NA NA 0.940615 2.023429 6.606085 0.000132 0.6607 -2.56902
17202349 17202349 NA NA NA 0.838046 5.041423 6.378744 0.00017 0.6607 -2.6041
17205531 17205531 NA NA NA 1.06037 2.07063 6.301779 0.000185 0.6607 -2.61658
17548311 17548311 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174
17548313 17548313 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174
17548642 17548642 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174
17548644 17548644 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174
17357560 17357560 NA NA NA -2.66352 3.567527 -6.10183 0.000232 0.708912 -2.65052

 

ADD REPLY
0
Entering edit mode

Well, I don't fully agree with you. Your annotation 'problem' HAS been solved, because SYMBOLs and GENENAMEs were retrieved and added to your output. I agree with you regarding the many NA's that are present. However, this has (solely) to do with the limited annotation information Affymetrix provides for this array. In other words, you have to 'blame' Affymetrix for providing such poorly annotated csv file... (which is the basis of all annotation files).

In this thread A: affycoretools annotateEset problem using Clariom D arrays James MacDonald provides an informative line of code that will show you the fraction of your data that could be annotated:

apply(fData(eset.anno2), 2, function(x) sum(!is.na(x))/length(x))

 

To reduce the number of not-annotated probeids you might considering to use the so-called custom-defined array definitions made by Manhong Dai from the Brain Array group here. Manhong remaps all probes present on the array to a current genome build available at e.g. the NCBI or ENSEMBL databases. In addition of filtering out probes that are not specific, another advantage is that (almost) all probeids are annotated. If you would like to go that way, below some code to get you started (note: this code uses the remapped probes based on the ENTREZG database from NCBI):

#Install required packages, assuming you are using Windows
install.packages("http://mbni.org/customcdf/22.0.0/entrezg.download/pd.mogene20st.mm.entrezg_22.0.0.zip", repos = NULL)
install.packages("http://mbni.org/customcdf/22.0.0/entrezg.download/mogene20stmmentrezg.db_22.0.0.zip", repos = NULL)

library(pd.mogene20st.mm.entrezg)
celFiles <- list.celfiles()
affyRaw <- read.celfiles(celFiles, pkgname = "pd.mogene20st.mm.entrezg")
eset <- rma(affyRaw)

library(mogene20stmmentrezg.db)
eset.anno3 <- annotateEset(eset, mogene20stmmentrezg.db)

 

ADD REPLY
0
Entering edit mode

Thank you very much for your reply. I'll take a closer look at it. Thank you very much 

ADD REPLY
0
Entering edit mode

Sorry, there's another question I'd like to ask you .I used the code above to annotate the data .But there are some small problems in the result .

  PROBEID ID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B
17210850 17210850 ENSMUST00000082908 Gm26206 predicted gene, 26206 0.018637 1.100376 0.180266 0.861197 0.996585 -4.9008
17210852 17210852 XR_398539 LOC102640548 uncharacterized LOC102640548 -0.02858 1.205122 -0.20729 0.840699 0.995097 -4.89808
17210855 17210855 NM_008866 Lypla1 lysophospholipase 1 0.008326 9.665614 0.050704 0.960741 0.998727 -4.90858
17210869 17210869 NM_001159750 Tcea1 transcription elongation factor A (SII) 1 0.210958 8.376269 1.464098 0.179386 0.969223 -4.4408
17210883 17210883 XR_373197 LOC102631647 uncharacterized LOC102631647 0.08286 2.004754 0.841665 0.423177 0.972998 -4.7356
17210887 17210887 NM_133826 Atp6v1h ATPase, H+ transporting, lysosomal V1 subunit H 0.01326 7.819807 0.180675 0.860886 0.996585

-4.90076

 


What does the XR-398539 mean in the column of ID?And, in the result, there are some annotated names of genes, but there is no name in the GPL annotation file. What's the reason? 

Sorry, my English is not very good, you know my description of the problem you have read? 

ADD REPLY
0
Entering edit mode

Mmm, you also need to explore things yourselves a bit...

XR is one of the 9 RefSeq annotation categories; the abbreviation XR is used to describe a 'predicted ncRNA model' that has been given the (numerical) ID 398539. Please note that this is a computational prediction, so no experimental evidence does (yet) exist for this gene (model) to exist. See also: https://en.wikipedia.org/wiki/RefSeq (or if that link will not work for you here or here).

Regarding the absence of info in the GPL annotation file: I think this has to do with the fact that the annotation info at GEO was last updated in 2013 (Jan 30, 2013: annotation table updated with netaffx build 33), whereas the PdInfo package has been created with the latest Affymetrix information available, which is from January 2017 (netaffx build 36). In other words, the annoation info available at GEO is outdated.

 

ADD REPLY
0
Entering edit mode

Thank you very much for your answer. I am a self-taught biological information, the school teachers and students are not very well understood, so there are many problems can not be solved, only online help. I will find some information to learn, thank you very much for your help 

ADD REPLY

Login before adding your answer.

Traffic: 720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6