GEOquery, GSEMatrix parameter and lifecycle of GEO series data

0

Entering edit mode

Gustavo Fernández Bayón ▴ 440

@gustavo-fernandez-bayon-5300

Last seen 9.3 years ago

Spain

Hi everybody. I am experiencing quite a few problems while trying to download and parse a dataset of methylation values. These are not technical problems, IMHO. GEOquery works perfectly, and it really makes getting this kind of data an easy task. However, I think I do not understand exactly the lifecycle of GEO series data, and I would like to ask in this list for any hint on this behavior, so I could try to fix it. What I first did was to download and parse the desired GSE data file, with the default value of GSMMatrix parameter (TRUE). Besides, I extracted the ExpressionSet and the assayData I was looking for. my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') my.expr.set <- my.gse[[1]] beta.values <- exprs(my.expr.set) What really gave me a surprise at first, was to see many strange values (all containing the 'NA' string) in the featureNames of the expression set. >head(featureNames(es), n=20) [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" If I select an individual GSM in the series, and download it, the featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I get a list of GSM data sets, and the results is again good. This made me suspect of the intermediate, pre-parsed, matrix form. I haven't found a clue about the lifecycle of this kind of data. I mean, how the matrix is built. Is it a manual process? Is it automatic? If it is a manual process, then I guess I will have to contact the responsible of uploading the data to see if they can fix it. But, if it is not, I would like to know if this is something relating to BioC or, more plausibly, to GEO. Any help would be appreciated. Regards, Gustavo --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)

PROcess GEOquery PROcess GEOquery • 1.9k views

ADD COMMENT • link 12.8 years ago Gustavo Fernández Bayón ▴ 440

0

Entering edit mode

Gustavo Fernández Bayón ▴ 440

@gustavo-fernandez-bayon-5300

Last seen 9.3 years ago

Spain

Hi again. I would like to add a little bit more of information on this issue. I have been debugging inside the parseGSEMatrix() function in GEOquery source code. The suspicious NA's appeared when execution arrived to the following line: ## Apparently, NCBI GEO uses case-insensitive matching ## between platform IDs and series ID Refs ??? dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),] The problem here is that 'datamat' has the correct number of rows, which is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the series matrix file while 'dat' comes from the GPL. If you go to the GEO page of that GPL (http://www.ncbi.nlm.nih.gov/geo /query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534), you'll find it says that the GPL decryption table has exactly 485577 rows, which is kind of logical, a description for each probeset. However, inside the code, 'dat' has only 143889 rows. Replicating directly from R console: >gpl <- getGEO('GPL13534',destdir='../../GEO/') >Meta(gpl)$data_row_count [1] "485577" >t <- Table(gpl) >dim(t) [1] 143889 37 I was really surprised to find this, and I do not have enough knowledge as to know if it responds to an unknown constraint I happen to ignore. Is that ok? Or is there any bug in the GPL processing code? Now I'm going home, but I'll try to continue debugging to see what is really happening inside. Any help will be very much appreciated. Regards, Gus --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El mi?rcoles 27 de junio de 2012 a las 10:51, Gustavo Fern?ndez Bay?n escribi?: > Hi everybody. > > I am experiencing quite a few problems while trying to download and parse a dataset of methylation values. These are not technical problems, IMHO. GEOquery works perfectly, and it really makes getting this kind of data an easy task. However, I think I do not understand exactly the lifecycle of GEO series data, and I would like to ask in this list for any hint on this behavior, so I could try to fix it. > > What I first did was to download and parse the desired GSE data file, with the default value of GSMMatrix parameter (TRUE). Besides, I extracted the ExpressionSet and the assayData I was looking for. > > my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') > my.expr.set <- my.gse[[1]] > beta.values <- exprs(my.expr.set) > > What really gave me a surprise at first, was to see many strange values (all containing the 'NA' string) in the featureNames of the expression set. > > > head(featureNames(es), n=20) > [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" > [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" > [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" > > > > If I select an individual GSM in the series, and download it, the featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I get a list of GSM data sets, and the results is again good. This made me suspect of the intermediate, pre-parsed, matrix form. I haven't found a clue about the lifecycle of this kind of data. I mean, how the matrix is built. Is it a manual process? Is it automatic? > > If it is a manual process, then I guess I will have to contact the responsible of uploading the data to see if they can fix it. But, if it is not, I would like to know if this is something relating to BioC or, more plausibly, to GEO. > > Any help would be appreciated. > > Regards, > Gustavo > > > --------------------------- > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)

ADD COMMENT • link 12.8 years ago Gustavo Fernández Bayón ▴ 440

0

Entering edit mode

On Wed, Jun 27, 2012 at 11:38 AM, Gustavo FernÃ¡ndez BayÃ³n <gbayon@gmail.com>wrote: > Hi again. > > I would like to add a little bit more of information on this issue. I have > been debugging inside the parseGSEMatrix() function in GEOquery source > code. The suspicious NA's appeared when execution arrived to the following > line: > > ## Apparently, NCBI GEO uses case-insensitive matching > ## between platform IDs and series ID Refs ??? > dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),] > > > > The problem here is that 'datamat' has the correct number of rows, which > is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the > series matrix file while 'dat' comes from the GPL. > > If you go to the GEO page of that GPL ( > http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu& acc=GPL13534), > you'll find it says that the GPL decryption table has exactly 485577 rows, > which is kind of logical, a description for each probeset. However, inside > the code, 'dat' has only 143889 rows. > > Replicating directly from R console: > > >gpl <- getGEO('GPL13534',destdir='../../GEO/') > >Meta(gpl)$data_row_count > [1] "485577" > > >t <- Table(gpl) > >dim(t) > [1] 143889 37 > > > > I was really surprised to find this, and I do not have enough knowledge as > to know if it responds to an unknown constraint I happen to ignore. Is that > ok? Or is there any bug in the GPL processing code? Now I'm going home, but > I'll try to continue debugging to see what is really happening inside. > > This is most likely a bug in GPL parsing. There are A LOT of edge cases that I have tried to deal with, some not very appropriately. Often, the error is due to an extraneous quote in an unexpected location. I'll look into this one. Could you do me a favor and send along sessionInfo() just so I know? Thanks, Sean > Any help will be very much appreciated. > > Regards, > Gus > > > --------------------------- > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > > El miÃ©rcoles 27 de junio de 2012 a las 10:51, Gustavo FernÃ¡ndez BayÃ³n > escribiÃ³: > > > Hi everybody. > > > > I am experiencing quite a few problems while trying to download and > parse a dataset of methylation values. These are not technical problems, > IMHO. GEOquery works perfectly, and it really makes getting this kind of > data an easy task. However, I think I do not understand exactly the > lifecycle of GEO series data, and I would like to ask in this list for any > hint on this behavior, so I could try to fix it. > > > > What I first did was to download and parse the desired GSE data file, > with the default value of GSMMatrix parameter (TRUE). Besides, I extracted > the ExpressionSet and the assayData I was looking for. > > > > my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') > > my.expr.set <- my.gse[[1]] > > beta.values <- exprs(my.expr.set) > > > > What really gave me a surprise at first, was to see many strange values > (all containing the 'NA' string) in the featureNames of the expression set. > > > > > head(featureNames(es), n=20) > > [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" > > [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" > > [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" > > > > > > > > If I select an individual GSM in the series, and download it, the > featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I > get a list of GSM data sets, and the results is again good. This made me > suspect of the intermediate, pre-parsed, matrix form. I haven't found a > clue about the lifecycle of this kind of data. I mean, how the matrix is > built. Is it a manual process? Is it automatic? > > > > If it is a manual process, then I guess I will have to contact the > responsible of uploading the data to see if they can fix it. But, if it is > not, I would like to know if this is something relating to BioC or, more > plausibly, to GEO. > > > > Any help would be appreciated. > > > > Regards, > > Gustavo > > > > > > --------------------------- > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Sean Davis 21k

0

Entering edit mode

Dear Sean and Gustavo, I cannot reproduce this error. See below. On 27/06/12 16:54, Sean Davis wrote: > On Wed, Jun 27, 2012 at 11:38 AM, Gustavo Fern??ndez Bay??n > <gbayon at="" gmail.com="">wrote: > >> Hi again. >> >> I would like to add a little bit more of information on this issue. I have >> been debugging inside the parseGSEMatrix() function in GEOquery source >> code. The suspicious NA's appeared when execution arrived to the following >> line: >> >> ## Apparently, NCBI GEO uses case-insensitive matching >> ## between platform IDs and series ID Refs ??? >> dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),] >> >> >> >> The problem here is that 'datamat' has the correct number of rows, which >> is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the >> series matrix file while 'dat' comes from the GPL. >> >> If you go to the GEO page of that GPL ( >> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu &acc=GPL13534), >> you'll find it says that the GPL decryption table has exactly 485577 rows, >> which is kind of logical, a description for each probeset. However, inside >> the code, 'dat' has only 143889 rows. >> >> Replicating directly from R console: >> >>> gpl <- getGEO('GPL13534',destdir='../../GEO/') >>> Meta(gpl)$data_row_count >> [1] "485577" >> >>> t <- Table(gpl) >>> dim(t) >> [1] 143889 37 >> >> >> >> I was really surprised to find this, and I do not have enough knowledge as >> to know if it responds to an unknown constraint I happen to ignore. Is that >> ok? Or is there any bug in the GPL processing code? Now I'm going home, but >> I'll try to continue debugging to see what is really happening inside. >> >> > This is most likely a bug in GPL parsing. There are A LOT of edge cases > that I have tried to deal with, some not very appropriately. Often, the > error is due to an extraneous quote in an unexpected location. I'll look > into this one. Could you do me a favor and send along sessionInfo() just > so I know? > > Thanks, > Sean > > > >> Any help will be very much appreciated. >> >> Regards, >> Gus >> >> >> --------------------------- >> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >> >> >> El mi??rcoles 27 de junio de 2012 a las 10:51, Gustavo Fern??ndez Bay??n >> escribi??: >> >>> Hi everybody. >>> >>> I am experiencing quite a few problems while trying to download and >> parse a dataset of methylation values. These are not technical problems, >> IMHO. GEOquery works perfectly, and it really makes getting this kind of >> data an easy task. However, I think I do not understand exactly the >> lifecycle of GEO series data, and I would like to ask in this list for any >> hint on this behavior, so I could try to fix it. >>> >>> What I first did was to download and parse the desired GSE data file, >> with the default value of GSMMatrix parameter (TRUE). Besides, I extracted >> the ExpressionSet and the assayData I was looking for. >>> >>> my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') >>> my.expr.set <- my.gse[[1]] >>> beta.values <- exprs(my.expr.set) >>> >>> What really gave me a surprise at first, was to see many strange values >> (all containing the 'NA' string) in the featureNames of the expression set. >>> >>>> head(featureNames(es), n=20) >>> [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" >>> [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" >>> [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" >>> >>> >>> >>> If I select an individual GSM in the series, and download it, the >> featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I >> get a list of GSM data sets, and the results is again good. This made me >> suspect of the intermediate, pre-parsed, matrix form. I haven't found a >> clue about the lifecycle of this kind of data. I mean, how the matrix is >> built. Is it a manual process? Is it automatic? >>> >>> If it is a manual process, then I guess I will have to contact the >> responsible of uploading the data to see if they can fix it. But, if it is >> not, I would like to know if this is something relating to BioC or, more >> plausibly, to GEO. >>> >>> Any help would be appreciated. >>> >>> Regards, >>> Gustavo >>> >>> >>> --------------------------- >>> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > library(GEOquery) my.gse <- getGEO('GSE30870', destdir=".") featureNames(my.gse[[1]])[1:10] # [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" # [6] "cg00000289" "cg00000292" "cg00000321" "cg00000363" "cg00000622" all(featureNames(my.gse[[1]]) == rownames(exprs(my.gse[[1]]))) #[1] TRUE gpl <- getGEO('GPL13534',destdir=".") Meta(gpl)$data_row_count == nrow(Table(gpl)) # [1] TRUE > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets tools methods [8] base other attached packages: [1] GEOquery_2.23.5 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] RCurl_1.91-1 XML_3.9-4 HTH, J.

ADD REPLY • link 12.8 years ago James F. Reid ▴ 120

Login before adding your answer.