What is the common work-flow to build an microarray annotation package, like hgu133a.db.
For some array, there are probe sequences available, then maybe mapping is used? While for other situations, how to deal with? If code used by the team available, that will be great. Thank you.
The specific goal is to build new platform annotation packages which are not available now from Bioconductor (what I need is just probe to gene symbols).
It seems Bioconductor update the annotation package when a new version releasing due to the update of gene symbols.
BTW, why name it as hgu133a.db instead of GPL96.db (from GEO) in Bioconductor? And user have to find the mapping relationship between them, though there are some mappings, such as https://gist.github.com/seandavi/bc6b1b82dc65c47510c7#file-platformmap-txt.
Thank you. I will check AnnotationForge package.
About the naming, another example is
hgug4112a
toAgilent-012391 Whole Human Genome Oligo Microarray G4112A (Feature Number version)
. It's awkward to find the mapping relationship between them if there is no the gist file supplied by seandavi, which is also incomplete. Usually, this kind of annotation package is used for annotating the GPLs. Are there other utility for the annotation package?Not really. A typical case study could be reading Affymetrix CEL files (
affy
package) usingReadAffy()
, followed byrma()
returns anExpressionSet
object. This automatically detects the correct annotation package. GEO is an independent project to Bioconductor (and so, there is no guarantee to have annotation packages at Bioconductor matching all platforms available at GEO). Of course, you may use the annotation packages in Bioconductor to annotate the arrays in GEO, but that is I think an extra benefit, not the original motivation. BTW, there is a way to obtain the existing correspondence between GPL and bioconductor annotation packages that may (or may not) be more up-to-date (I got this from Sean Davis: dplyr and the GEOmetadb package for mining NCBI GEO metadata):ReadAffy()
, working with Affymetrix only, will return anAffyBatch
object with an annotation name, such ashgu133plus2
, by functionannotation()
. And thefeatureData
will be null (Why it's null? which means it does not catch the annotation package information automatically). While ifgetGEO()
is used, theannotation()
will return GPLxx and thefeatureData()
will be from GPLxx in GEO and not empty. Thank you for you showing of the GPL-annotation relationship file.update:getGEO
will get the annotation package information if parameterAnnotGPL
isTRUE
and the package exists.update2:
getGEO
will get the updated annotation information if parameterAnnotGPL
isTRUE
and the annotation file exists, like GPL570.annot.gz.Was missing the
rma()
step- corrected. It is NULL because by default it does not contain feature information (you can add it with the annotation package if you wish). getGEO() was developed later and gets the information from GEO, hence it contains the associated platform and some featureData that is available at GEO. But there is not direct translation as you said. Using the info I gave you (or the link to the gist) you can know the correspondence for existing packages. If the package is not there you may want to build one yourself with the AnnotationForge package.About AnnotationForge, Here I have the probe name and sequence (for example, for GPL6480). I do not want to use the annotation from GEO and want to update the annotation by myself. However, AnnotationForge requires a kind of id (such as Genbank ID) with the probe name. Here it seems mapping is the first step. So usually why function in R are used to map the probe to gene/miRNA?
You could map the probes to the genome using
Biostrings
package and then annotate them using overlapping to transcripts. I have done that for some arrays a few months ago. Will post a script later if I have some time... Not sure if the recipe is also available in some vignette/or in a post in the support site.