Dear ALL,
i have an expression dataset(called data.trusted.eset)(affymetrix, hgu133a) which i have preprocessed and filtered and which i used as the input for my limma paired analysis to find DE genes. My basic question is, that i want to use the same expressionSet an an input(exprs(data.trusted.eset)) as a basic argument to a gene network co-regulation analysis,----but instead of probesets as rownames i have to replace them with unique gene symbols. Thus, remove in an appropriate way remove NAs but also duplicates.
I tried:
affyUniverse <- featureNames(data.trusted.eset)
symbols <- hgu133aSYMBOL[affyUniverse]
symbol.Universe <- unique(as.character(symbols))
My main problem is then how to reduce my expression set with these symbols. Or my approach is totally wrong ?
Any suggestions or ideas would be grateful
You have to get out of the habit of using things like mget() for accessing data from the annotation packages. By default these old methods return NA for any probeset with one-to-many mappings. Use select() instead, as I have already shown.
Otherwise I think your code is OK.
thank you again for your notification- i have read from a vignette about mget(http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOstatsHyperG.pdf) thats why i have used it
Dear Mr MacDonald,
please excuse me for posting one more question for the above subject, but i have noticed something very important and i would like to elusidate a specific issue(so please excuse me again for any "naive" questions). More specifically, after your above function:
i noticed that except from the usual case of different probesets matching to the same Gene Symbol(transcript), i have also noticed several probesets, each of them (could) match to more than one gene symbols. Thus, my main concern is that with
might we (choose) arbitrary select a symbol for each probeset(which is the first incidence with !duplicated function), which might also includes non-coding genes in some circumstances ? Could for instance a probeset might be affected to coding or non-coding genes ? And also this although includes more annotated probesets than mget, affect in a different way my downstream analysis?
Thank you in advance
You would have to check this to be sure, but I am not sure that there was any non-coding content on the old HG-U133A array. Note that these arrays were based on UniGene build 133. I am not an expert on the various NCBI databases, but it seems to me that UniGene is intended to be for genes that are translated to protein. In addition, that build is from like, 2003 or so, and I don't think there was much known untranslated content.
Ι understand . I have so far checked it manually. For instance, for the probeset
265 200715_x_at RPL13A
266 200715_x_at SNORD35A
the second match if im mistaken it is a non-coding gene. Thus far, maybe it seems random but for a lot of these probesets the first match(which also is chosen from the !duplicated()) is a coding gene and also for instance
4637 210483_at LOC254896
4638 210483_at TNFRSF10C
the first match is a RNA gene(non-coding)
Thus, if there is a small number after the removal of duplicate probesets which might match to untranslated transcripts, should i keep them in order not to use generally information by removing them as NAs with mget ? The difference between the code i posted and your code is about 200 probesets, but it might be significant
This is the problem with the one-to-many probesets; it's not clear how to select the 'best' gene for a given probeset, and it is certainly not easy to do this in an automated way.
Ok thank you again for your consideration. I would consider both approaches and i will finally choose one of them. I hope than the small amount of these probesets wouldnt affect too much the downstream analysis, as you mentioned there isnt a "gold standard" for the most powerful approach