Annotating your data that way is probably not that useful. The raw Affy annotation is, shall we say, messy. Also note that the default summarization for Gene ST arrays (and the level that IMO you should only ever summarize the Gene ST arrays) is the 'core' or 'transcript' level, and you asked for the 'probeset' level annotation, so you have a mismatch. In addition, there are any number of things on these arrays that won't have any annotation per se. The first N probesets on the array are classified as normgene->intron, which is a background control probe that is supposed to be a background because they are complementary to intronic sequences. Because Affy somehow seems to think you wouldn't get signal from introns when using random primers, which just boggles the mind, but whatever.
Anyway, don't do that. Instead, you want to use the hugene20sttranscriptcluster.db package. You most likely want to use the select() function:
library(hugene20sttranscriptcluster.db)
gns <- select(hugene20sttranscriptcluster.db, featureNames(celfiles), c("ENTREZID","SYMBOL","GENENAME"))
But do note that a.) this will return one-to-many mappings, because a given probeset may interrogate multiple things, and b.) for this release, it will do that silently - you won't get a warning, so you have to remember this! I usually do the most naive thing possible, which is
gns <- gns[!duplicated(gns[,1]),]
And then you could stuff that into an AnnotationDataFrame and put it in your ExpressionSet. If you forget to subset and try to put that into your ExpressionSet, you will get a warning, so at least there is some protection when using that route.
Thanks for the detailed reply,
first off, the reason I had probesets was that the PhD student who I'm working with (I'm doing my masters thesis) copied that line of code from his script for me to use.
secondly, this is being done with mouse genes, so I assume that I replace the hugene20st with mogene20st
trying to use the commands I get the message
Tried initially with the hugene, swapped to mogene thinking it was just wrong IDs and got same output
thanks for your help so far
Oh yeah, I forgot. MoGene...
I made the assumption (given the sparsity of your code) that you had an ExpressionSet called 'celfiles' that you generated doing something like
In which case
featureNames(celfiles)
would give you all the Affy IDs for that array. But you have evidently done something else, so you should probably give us some more code showing what the 'celfiles' object is, and any modifications you have made to it.that is what I did,
celfiles is an expression set, whereas geneName is an annotated dataframe generated using
geneName=getNetAffx(celfiles, type="probeset")
I posted that because that was the structure of the annotation I was getting at the time
as for my code, so far I have this (I've left of the graphs created as those aren’t relevant here
This part
is almost surely not what you want to be doing. Can you say why you are doing that?
Your ExpressionSet is called 'norm', so you would annotate by doing something like
I use those three annotations - see
columns(mogene20sttranscriptcluster.db
) for more choices. ThenThen
will have all the annotations included.
the reason for putting those ranges in was me trying to replicate what I'd seen using example data for the "Heatplus" tutoria (which had case/control, and coded those 0/1, having rerun without I see I dont need it to get the clustering, (though the heatmap isn't quite as petty and lacks the clear cut differentiation between them)
having run that code it works perfectly (was initially worried as the first 1000 lines shown were all NA) but I now have the names showing, amazed that its around ~14k NA probes there though.
Thank you for helping me with this
James W. MacDonald Sorry to comment on an old thread.
But I have been trying to use hugene20sttranscriptcluster.db to get entrezids for the transcript cluster ids and I get an error message which i am really not sure what it means. Following is the code:
Details of the platform:
GPL5175 [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]
Read files
oligobatch=oligo::read.celfiles(filenames = file.path(myd,SDRF$ID),verbose = FALSE, phenoData = SDRF)
Normalized data
RMAmyData=oligo::rma(oligobatch, target = "core") # I assume that I have summarize the probes at transcript level
gene symbols using getNetAffy
featureData(RMAmyData)= getNetAffx(RMAmyData, "transcript") # This works but I would like to have entrezids and getNetAffx doesn't annotate them.
Hence Annotation using hugene20sttranscriptcluster.db for entrezids
Input file for annotation looks like this
head(featureNames(RMAmyData)) [1] "2315554" "2315633" "2315674" "2315739" "2315894" "2315918" # I assume that these are transcript ids. Not sure though!
Annotation
EntrezIDmyData <- AnnotationDbi::select(hugene20sttranscriptcluster.db, featureNames(RMAmyData), c("ENTREZID","SYMBOL"))
Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.
I am confused with how the error reads. Does that mean that my data has not been summarized from exon-level to transcript level? Or it's just a generic error and where I should just focus on trouble shooting annotation step using 'hugene20sttranscriptcluster.db'. If yes, where am I going wrong?
I would really appreciate any response. Thanks.
Don't comment on old posts. Make a new one instead