I am currently working with Clariom D Human Microarray. After normalization of the data with RMA{oligo}, I get 138745 features which I need to annotate for the downstream analysis that includes differential expression and gene enrichment steps. I would like to assign a gene identifier to a given probeid using the select() function of {AnnotationDbi}:
>all.eset=oligo::rma(oligo::read.celfiles(myCelFiles),target="core")
>annotation_testdata <- AnnotationDbi::select(clariomdhumantranscriptcluster.db, featureNames(all.eset), c("SYMBOL","GENENAME","ENTREZID", "ENSEMBL"))
We have only 31575 features of which we have gene identifiers:
> length(which(!is.na(annotation_testdata$SYMBOL)))
[1] 31575
> length(which(!is.na(annotation_testdata$GENENAME)))
[1] 31575
> length(which(!is.na(annotation_testdata$ENTREZID)))
[1] 31575
> length(which(!is.na(annotation_testdata$ENSEMBL)))
[1] 29625
> length(unique(annotation_testdata$ENTREZID))
I know that there are some features that just serve to control but I imagine that not as much ??
Thank you in advance for your help
Please use the ADD COMMENT link to add further comments or questions. If you use the Add your answer box, it looks like you are answering a question, which clearly you are not.
The summarization doesn't work the way you think, really. There are probes, and Affy has collected them into groups (probe set regions, or PSRs) that are intended to measure portions of an exon, or exon-exon junctions. This is the probeset level summarization, and you only get some information about the particular PSR being interrogated, but you can then hypothetically use that information to model differential exon usage.
Affy has also defined a collection of PSRs that measure all the exons for a set of known transcripts. The expression values for this summarization level are intended to give an estimate of the transcript abundance, which may differ between transcripts for a given gene (or may not - these days Affy recycles the probes quite a bit, so two transcripts for a given gene might only differ by a couple of PSR probesets).
But at no time is there any summarization at the gene level, because there is no such thing. Genes are regions defined on a genome, which give rise to transcripts of (possibly) varying lengths, and that is what we measure. It just happens that it is easier to think about 'gene' expression because it's easier to interpret higher/lower expression for a given gene, and much more difficult to interpret a combination of higher/lower expression for the N different transcripts of that gene, not to mention accurately estimating the transcription levels for all those transcripts.
You could hypothetically summarize all the transcript level probesets to a 'gene' level, by for example computing the mean of all the transcript probesets for that gene, or maybe picking the 'best' one, or just randomly picking one transcript per gene. But that's up to you.
So if you summarize at the probeset level, you get estimates for all the PSR and JUC probesets, and if you summarize at the core level you get estimates for all the various transcripts that Affy say are being measured. You should get lots more values if you summarize at the probeset level as compared to the transcript level.
Because of this, there are two different annotation packages. There are the clariomdhumanprobeset.db and the clariomdhumantranscript.db packages, and if you use the internal Affy data that comes with the pd.clariom.d.human package (with
annotateEset
), you specify using the 'type' argument, specifying either 'core' (the default) or 'probeset'.Hi James, thank you again for your help. So the level-transcript is estimated from JUCs or PSRs probes?