Hello,
I am currently analyzing a microarray study which was deposited on GEO. I am using the oligo
package to obtain gene level summaries after background correction.
Now I want to map all the remaining probesets to to entrez gene ids, but I wasn't sure how to do that. I found the getProbeInfo
function which seems to be exactly what I am looking for, but was not sure which field
corresponds to entrez gene id (or ensembl gene id).
This is my code so far:
file_names <- list.files("/DDAS/MA", pattern = "CEL", full.names = T)
rawData <- read.celfiles(file_names)
geneSummaries <- rma(rawData, target="core")
probeInfo <- getProbeInfo(geneSummaries ,field= c('fid', 'fsetid', 'atom',
'transcript_cluster_id','exon_id'), target='core')
The command rownames(geneSummaries)
returns a list which looks something like this:
"2371346" "2371474" "2371547" "2371694" "2371738" "2371873" "2372103", ...
I would very much appreciated if somebody could tell me what the correct field
name is to get the gene identifiers, or any other way how I can get gene ids from geneSummaries
.
Thank you very much! This is extremly useful. Just a quick question. When I ran your command and specify
type="core"
I still get this warning:Any Idea why this might happen? The command
annotation(geneSummaries)
returns "pd.huex.1.0.st.v2".Also, just for my own understanding, would the following command return a similar annotation dataframe?
Thanks again!
You could instead do
You shouldn't use type = 'core' if you have summarized at the transcript level. Using type = 'core' implies that you summarized at the PSR (roughly exon) level, which is why there are so many more annotations than probesets.
It can be a problem to use
select
with more than one column, because under the hood it's doing a left join on those two columns, so if there are any 1 to many matches you end up with way more rows than you would expect. WhatannotateEset
does is repeated calls tomapIds
, which returns only one result per probeset ID, for each column, and then collapses to adata.frame
. That way you are ensured that the annotations match perfectly with the underlying data.Here's an extreme example:
I should note that
mapIds
might not do what you exactly want. By default it takes the first of any 1:many mappings, which may or may not be what you want. As an example, there are quite a few Genbank and RefSeq (the ACCNUM column) IDs that map to each of the Gene IDs I supplied in the above example. The first one might not be the 'right one' in some sense, and maybe you want more than that. There are ways to get all the 1:many mappings (aDataFrame
being the canonical structure for Bioconductor), but in general usage having one is enough.Thank you very much for your detailed answer and your example! It is much appreciated.
Might I ask why this is the case? I read in the
oligo
userguide that specifiyingtarget ="core"
inrma
summarized to gene level. Does this mean that "core" is something different betweenannotateEset
andoligo
. And if so, what is the "correct" target if I have usedtarget ="core"
inrma
?Thanks again!