Entering edit mode
Hi everybody.
I am experiencing quite a few problems while trying to download and
parse a dataset of methylation values. These are not technical
problems, IMHO. GEOquery works perfectly, and it really makes getting
this kind of data an easy task. However, I think I do not understand
exactly the lifecycle of GEO series data, and I would like to ask in
this list for any hint on this behavior, so I could try to fix it.
What I first did was to download and parse the desired GSE data file,
with the default value of GSMMatrix parameter (TRUE). Besides, I
extracted the ExpressionSet and the assayData I was looking for.
my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/')
my.expr.set <- my.gse[[1]]
beta.values <- exprs(my.expr.set)
What really gave me a surprise at first, was to see many strange
values (all containing the 'NA' string) in the featureNames of the
expression set.
>head(featureNames(es), n=20)
[1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3"
[8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734"
[15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12"
If I select an individual GSM in the series, and download it, the
featureNames are ok. If I try to download the GSE with
GSEMatrix=FALSE, I get a list of GSM data sets, and the results is
again good. This made me suspect of the intermediate, pre-parsed,
matrix form. I haven't found a clue about the lifecycle of this kind
of data. I mean, how the matrix is built. Is it a manual process? Is
it automatic?
If it is a manual process, then I guess I will have to contact the
responsible of uploading the data to see if they can fix it. But, if
it is not, I would like to know if this is something relating to BioC
or, more plausibly, to GEO.
Any help would be appreciated.
Regards,
Gustavo
---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)